1
Designing Written Assessment of Student Learning Carlo Magno Jerome Ouano
2
Chapter 1 Assessment, Measurement, and Evaluation Chapter Objectives 1. 2. 3. 4. 5.
Describe assessment in the educational and classroom setting. Identify ways on how assessment is conducted in the educational setting. Explain how is assessment integrated with instruction and learning. Distinguish the critical features of measurement, evaluation, and assessment. Provide the uses of assessment results.
Lessons 1 2
3
Assessment in the Classroom Context The Role of Measurement and Evaluation in Assessment The Nature of Measurement The Nature of Evaluation Forms of Evaluation Models of Evaluation Examples of Evaluation Studies The Process of Assessment The Process of Assessment Forms of Assessment Components of Classroom Assessment Paradigm Shifts in the Practice of Assessment Uses of Assessment
3
Lesson 1: Assessment in the Classroom Context To better understand the nature of classroom assessment, it is important to answer three questions: (1) What is assessment? (2) How is assessment conducted? And, (3) when is assessment conducted?
What is assessment?
How is assessment conducted?
When is assessment conducted?
It is customary in the educational setting that at the end of a quarter, trimester, or semester, students receive a grade. The grade reflects a combination of different forms of assessment that both the teacher and the student have conducted. These grades were based on a variety of information that the student and teacher gathered in order to objectively come up with a value that is very much reflective of the student’s performance. The grades also serve to measure how well the students have accomplished the learning goals intended for them in a particular subject, course, or training. The process of collecting various information needed to come up with an overall information that reflects the attainment of goals and purposes is referred to as assessment (The details of this process will be explained in the next section). The process of assessment involves other concepts such as measurement, evaluation, and testing (The distinction of these concepts and how they are related will be explained in the proceeding section of the book). The teacher and students uses various sources in coming up with an overall assessment of the student’s performance. A student’s grade that is reflective of their performance is a collective assessment from various sources such as recitation, quizzes, long tests, final exams, projects, final papers, performance assessments, and the other sources. Different schools and teachers would give certain weights to these identified criteria depending on their set goals for the subject or course. Some schools assign weights based on the nature of the subject area, some teachers would base it on the objectives set, and others treat all criteria set with equal weights. There is no ideal weight for these various criteria because it will depend on the overall purpose of the learning and teaching process, orientation of the teachers, and goals of the school. An overall assessment should come from a variety of sources to be able to effectively use the information in making decisions about the students. For example, in order to promote a student on the next grade or year level, or move to the next course, the information taken about the student’s performance should be based on multiple forms of assessment. The student should have been assessed in different areas of their performance to make valid decisions such as for their promotion, deciding the top pupils, honors, and even failure and being retained to the current level. These sources come from objective assessments of learning such as several quizzes, a series of recitation, performance assessments on different areas, and feedback. These forms of assessment are generally given in order to determine how well the students can demonstrate a sample of their skills.
4
Assessment is integrated in all parts of the teaching and the learning process. This means that assessment can take place before instruction, during instruction, and after instruction. Before instruction, teachers can use assessment results as basis for the objectives and instructions for their plans. These assessment results come from the achievement tests of students from the previous year, grades of students from the previous year, assessment results from the previous lesson or pretest results before instruction will take place. Knowing the assessment results from different sources prior to planning the lesson help teachers decide on a better instruction that is more fit for the kind of learners they will handle, set objectives appropriate for their developmental level, and think of a better ways of assessing students to effectively measure the skills learned. During instruction, there are many ways of assessing student performance. While class discussion is conducted, teachers can ask questions and students can answer them orally to assess whether students can recall, understand, apply, analyze, evaluate, and synthesize the facts presented. During instruction teachers can also provide seat works and work sheets on every unit of the lesson to determine if students have mastered the skill needed before moving to the next lesson. Assignments are also provided to reinforce student learning inside the classroom. Assessment done during instruction serves as formative assessment where it is meant to prepare students before they are finally assessed on major exams and tests. When the students are ready to be assessed after instruction took place, they are assessed in a variety of skills they are trained for which then serves as a summative form of assessment. Final assessments come in the forms of final exams, long tests, and final performance assessment which covers larger scope of the lesson and more complex skills are required to be demonstrated. Assessments conducted at the end of the instruction are more structured and announced where students need time to prepare.
Review Questions: 1. What are the other processes involved in assessment? 2. Why should there be several sources of information in order to come up with an overall assessment? 3. What are the different purposes of assessment when conducted before, during, and after assessment? 4. Why is assessment integrated in the teaching and learning process?
Activity #1 Ask a sample of students the following questions: 1. Why do you think assessment is needed in learning? 2. What are the different ways of assessing student learning in the courses you are taking? Tabulate the answers and present the answers in class
5
Lesson 2 The Role of Measurement and Evaluation in Assessment The concept of assessment is broad that it involves other processes such as measurement and evaluation. Assessment involves several measurement processes in order to arrive with quantified results. When assessment results are used to make decisions and come up with judgments, then evaluation takes place.
Measurement
Assessment
Evaluation
The Nature of Measurement Measurement is an important part of assessment. Measurement has the features of quantification, abstraction, and further analysis that is typical in the process of science. Some assessment results come in the forms of quantitative values that enable the use of further analysis. Obtaining evidence of different phenomena in the world can be based on measurement. A statement can be accepted as true or false if the event can be directly observed. In the educational setting, before saying that a student is “highly intelligent,” there must be observable proofs to demonstrate that the student is indeed “highly intelligent.” The people involved in identifying whether a student is “highly gifted” will have to gather evidence accurate information to claim the student as such. When people start demonstrating certain characteristics such as “intelligence,” by making a judgment, obtaining a high test score, exemplified performance in cognitive tasks, high grades, then measurement must have taken place. If measurement is carefully done, then the process meets the requirements of scientific inquiry. Objects per se are not measured, what is measured are the characteristics or traits of objects. These measurable characteristics or traits are referred to as variables. Examples of variables that are studied in the educational setting are intelligence, achievement, aptitude, interest, attitude, temperament, and others. Nunnaly (1970) defined measurement as “consist of rules for assigning numbers to objects in such a way as to represent quantities of attributes.” Measurement is used to quantify characteristics of objects. Quantification of characteristics or attributes has advantages:
6
1. Quantifying characteristics or attributes determines the amount of that attribute present. If a student was placed in the 10th percentile rank on an achievement test, then that means that the student has achieved less in reference to others. A student who got a perfect score on a quiz on the facts about the life of Jose Rizal means that the student has remembered enough information about Jose Rizal. 2. Quantification facilitates accurate information. If a student gets a standard score of -2 on a standardized test (standard scores ranges from -3 to +3 where 0 is the mean), it means that the student is below average on that test. If a student got a stannine score of 8 on a standardized test (stannine scores ranges from 1 to 9 where 5 is the average), it means that the student is above the average or have demonstrated superior ability on the trait measured by the standardized test. 3. Quantification allows objective comparison of groups. Suppose that male and female students were tested in their math ability using the same test for both groups. Then mean results of the males math scores is 92.3 and the mean results of the females math scores is 81.4. It can be said that males performed better in the math test than females when tested for significance. 4. Quantification allows classification of groups. The common way of categorizing sections or classes is based on students’ general average grade from the last school year. This is especially true if there are designated top sections within a level. In the process, students grades are ranked from highest to lowest and the necessary cut-offs are made depending on the number of students that can be accommodated in a class. 5. Quantification results make the data possible for further analysis. When data is quantified, teachers, guidance counselors, researchers, administrators, and other personnel can obtain different results to summarize and make inferences about the data. The data may be presented in charts, graphs, and tables showing the means and percentages. The quantified data can be further estimated using inferential statistics such as when comparing groups, benchmarking, and assessing the effectiveness of an instructional program. The process of measurement in the physical sciences (physics, chemistry, biology) is similar in education and the social sciences. Both use instruments or tools to arrive with measurement results. The only difference is the variables of interest being measured. In the physical sciences, measurement is more accurate and precise because of the nature of physical data which is directly observable and the variables involved are tangible in all senses. In education, psychology, and behavioral science, the data is subject to measurement errors and large variability because of individual differences and the inability to control variations in the measurement conditions. Although in education, psychology, and behavioral science, there are statistical procedures for obtaining measurement errors such as reporting standard deviations, standard errors, and variance. Measurement facilitates objectivity in the observation. Through measurement, extreme differences in results are avoided, provided that there is uniformity in conditions and individual differences are controlled. This implies that when two persons measure a variable following the same conditions, they should be able to get consistent results. Although there may be slight difference (especially if the variable measured is psychological in nature), but the results should
7
be at least consistent. Repeating the measurement process several times and consistency of results would mean objectivity of the procedure undertaken. The process of measurement involves abstraction. Before a variable is measured using an instrument, the variable’s nature needs to be clarified and studied well. The variable needs to be defined conceptually and operationally to identify ways on how it is going to be measured. Knowing the conceptual definition based on several references will show the theory or conceptual framework that fully explains the variable. The framework reveals whether the variable is composed of components or specific factors. Then these specific factors need to be measured that comprise the variable. A characteristic that is composed of several factors or components are called latent variables. The components are usually called factors, subscales, or manifest variables. An example of a latent variable would be “achievement.” Achievement is composed of factors that include different subject areas in school such as math, general science, English, and social studies. Once the variable is defined and its underlying factors are identified, then the appropriate instrument that can measure the achievement can now be selected. When the instrument or measure for achievement is selected, it will now be easy to operationally define the variable. Operational definition includes the procedures on how a variable will be measured or made to occur. For example, ‘achievement’ can be operationally defined as measured by the Graduate Record Examination (GRE) that is composed of verbal, quantitative, analytical, biology, mathematics, music, political science, and psychology. When a variable is composed of several factors, then it is said to be multidimensional. This means that a multidimensional variable would require an instrument with several subtests in order to directly measure the underlying factors. A variable that do not have underlying factors is said to be unidimensional. A unidimensional variable only measures an isolated unitary attribute. An example of unidemensional measures are the Rosenberg self-esteem scale and the Penn State Worry Questionnaire (PSWQ). Examples of multidimensional measures are various ability tests and personality tests where it is composed of several factors. The 16 PF is a personality test that is composed of 16 components (researved, more intelligent, affected by feelings, assertive, sober, conscientious, venturesome, tough-minded, suspicious, practical, shrewd, placid, experimenting, self-sufficient, controlled, and relaxed). The common tools used to measure variables in the educational setting are tests, questionnaires, inventories, rubrics, checklists, surveys and others. Tests are usually used to determine student achievement and aptitude that serve a variety of purposes such as entrance exam, placement tests, and diagnostic tests. Rubrics are used to assess performance of students in their presentations such as speech, essays, songs, and dances. Questionnaires, inventories, and checklists are used to identify certain attributes of students such as their attitude in studying, attitude in math, feedback on the quality of food in the canteen, feedback on the quality of service during enrollment, and other aspects. The Nature of Evaluation Evaluation is arrived when the necessary measurement and assessment have taken place. In order to evaluate whether a student will be retained or promoted to the next level, different aspects of the student’s performance were carefully assessed and measured such as the grades and conduct. To evaluate whether the remedial program in math is effective, the students’ improvement in math, teachers teaching performance, students attitude change in math should be carefully assessed. Different measures are used to assess different aspects of the remedial
8
program to come up with an evaluation. According to Scriven (1967) that evaluation is “judging the worth or merit” of a case (ex. student), program, policies, processes, events, and activities. These objective judgments derived from evaluation enable stakeholders (a person or group with a direct interest, involvement, or investment in the program) to make further decisions about the case (ex. students), programs, policies, processes, events, and activities. In order to come up with a good evaluation, Fitzpatrick, Sanders, and Worthen (2004) indicated that there should be standards for judging quality and deciding whether those standards should be relative or absolute. The standards are applied to determine the value, quality, utility, effectiveness, or significance of the case evaluated. In evaluating whether a university has a good reputation and offers quality education, it should be comparable to a standard university that topped the World Rankings of University. The features of the university evaluated should be similar with the standard university selected. Or a standard can be in the form of ideal objectives such as the ones set by the Philippine Accreditation of Schools, Colleges, and Universities (PAASCU). A university is evaluated if they can meet the necessary standards set by the external evaluators. Fitzpatrick, Sanders, and Worthen (2004) clarified the aims of evaluation in terms of its purpose, outcome, implication, setting of agenda, generalizability, and standards. The purpose of evaluation is to help those who hold a stake in whatever is being evaluated. Stakeholders consist of many groups such as students, teachers, administrators, and staff. The outcome of evaluation leads to judgment whether a program is effective or not, whether to continue or stop a program, whether to accept or reject a student in the school. The implication that evaluation gives is to describe the program, policies, organization, product, and individuals. In setting the agenda for evaluation, the questions for evaluation come from many sources, including the stakeholders. In making generalizations, a good evaluation is specific to the context in which the evaluation object rests. The standards of a good evaluation are assessed in terms of its accuracy, utility, feasibility, and propriety. A good evaluation adheres to the four standards of accuracy, utility, feasibility, and propriety set by the ‘Joint Committee on Standards for Educational Evaluation’ headed by Daniel Stufflebeam in 1975 at Western Michigan University’s Evaluation Center. These four standards set are now referred to as ‘Standards for Evaluation of Educational Programs, Projects, and Materials.’ Table 1 presents the description of the four standards.
9
Table 1 Standards for Evaluation of Educational Programs, Projects, and Materials Standard
Summary
Utility
Intended to ensure that an evaluation will serve the information needs of its intended users.
Feasibility
Intended to ensure that an evaluation will be realistic, prudent, diplomatic, and frugal. Intended to ensure that an evaluation will be conducted legally, ethically, and with due regard for the welfare of those involved in the evaluation as well as those affected by its results. Intended to ensure that an evaluation will reveal and convey technical adequate information about the features that determine the worth or merit of the program being evaluated.
Propriety
Accuracy
Components Stakeholder identification, evaluator credibility, information scope and selection, values identification, report clarity, report timeliness and dissemination, evaluation impact Practical procedures, political viability, cost effectiveness Service orientation, formal agreements, rights of human subjects, human interaction, complete and fair assessment, disclosure of findings, conflict of interest, fiscal responsibility Program documentation, content analysis, described purposes and procedures, defensible information sources, valid information, reliable information, systematic information, analysis of quantitative information, analysis of qualitative information, justified conclusions, impartial reporting, metaevaluation
Forms of Evaluation Owen (1999) classified evaluation according to its form. He said that evaluation can be proactive, clarificative, interactive, monitoring, and impact. 1. Proactive. Ensure that all critical areas are addressed in an evaluation process. Proactive evaluation is conducted before a program begins. It assists stakeholders in making decisions on determining the type of program needed. It usually starts with needs assessment to identify the needs of stakeholders that will be implemented in the program. A review of literature is conducted to determine the best practices and creation of benchmarks for the program. 2. Clarificative. This is conducted during program development. It focuses on the evaluation of all aspects of the program. It determines the intended outcomes and how the program designed will achieve them. Determining the how the program will achieve its goals involves determining the strategies that will be implemented. 3. Interactive. This evaluation is conducted during program development. It focuses on improving the program. It identifies what is the program trying to achieve, whether the goals are consistent with the plan, and how can the program be changed to make the goals effective. 4. Monitoring. This evaluation is conducted when the program has settled. It aims to justify and fine tune the program. It focuses whether the outcome of the program has delivered to its intended stakeholders. It determines the target population, whether the implementation meets the benchmarks, be changed to be done in the program to make it more efficient.
10
5. Impact. This evaluation is conducted when the program is already established. It focuses on the outcome. It evaluates if the program was implemented as planned, whether the needs were served, whether the goals are attributable to the program, and whether the program is cost effective. These forms of evaluation are appropriate at certain time frames and stage of a program. The illustration below shows when each evaluation is appropriate. Planning and Development Phase Proactive Clarificative
Program Duration Implementation Interactive and monitoring
Settled Impact
Models of Evaluation Evaluation is also classified according to the models and framework used. The classifications of the models of evaluation are objectives-oriented, management oriented, consumer-oriented, expertise-oriented, participant-oriented, and theory driven. 1. Objectives-oriented. This model of evaluation determines the extent to which the goals of the program are met. The information that results in this model of evaluation can be used to reformulate the purpose of the program evaluated, the activity itself, and the assessment procedures used to determine the purpose or objectives of the program. In this model there should be a set of established program objectives and measures are undertaken to evaluate which goals were met and which goals were not met. The data is compared with the goals. The specific models for the objectives-oriented are the Tylerian Evaluation Approach, Metfessel and Michael’s Evaluation Paradigm, Provus Discrepancy Evaluation Model, Hammond’s Evaluation Cube, and Logic Model (see Fitzpatrick, Sanders, & Worthen, 2004). 2. Management-oriented. This model is used to aid administrators, policy-makers, boards and practitioners to make decisions about a program. The system is structured around inputs, process, and outputs to aid in the process of conducting the evaluation. The major target of this type of evaluation is the decision-maker. This form of evaluation provides the information needed to decide on the status of a program. The specific models of this evaluation are the CIPP (Context, Input, Process, and Product) by Stufflebeam, Alkin’s UCLA Evaluation Model, and Patton’s Utilization-focused evaluation (see Fitzpatrick, Sanders, & Worthen, 2004). 3. Consumer-oriented. This model is useful in evaluating whether is product is feasible, marketable, and significant. A consumer-oriented evaluation can be undertaken to determine if there will be many enrollees of a school that will be built on a designated location, will there be takers of a graduate program proposed, and is the course producing students that are employable. Specific models for this evaluation are Scriven’s Key Evaluation Checklist, Ken Komoski’s EPIE Checklist, Morrisett and Stevens Curriculum Materials Analysis System (CMAS) (see Fitzpatrick, Sanders, & Worthen, 2004).
11
4. Expertise-oriented. This model of evaluation uses an external expert to judge an institution’s program, product, or activity. In the Philippine setting, the accreditation of schools is based on this model. A group of professional experts make evaluations based in the existing school documents. These group of experts should complement each other in producing a sound judgment of the school’s standards. This model comes in the form of formal professional reviews (like accreditation), informal professional reviews, ad hoc panel reviews (like funding agency review, blue ribbon panels), ad hoc individual reviews, and educational connoisseurship (see Fitzpatrick, Sanders, & Worthen, 2004). 5. Participant-oriented. The primary concern of this model is to serve the needs of those who participate in the program such as students and teachers in the case of evaluating a course. This model depends on the values and perspectives of the recipients of an educational program. The specific models for this evaluation is Stake’s Responsive evaluation, Patton’s Utilizationfocused evaluation, Rappaport’s Empowerment Evaluation (see Fitzpatrick, Sanders, & Worthen, 2004). 6. Program Theory. This evaluation is conducted when stakeholders and evaluators intend to determine to understand both the merits of a program and how its transformational processes can be exploited to improve the intervention (Chen,2005). The effectiveness of a program in a theory driven evaluation takes into account the causal mechanism and its implementation processes. Chen (2005) identified three strengths of the program theory evaluation: (1) Serves accountability and program improvement needs, (2) establish construct validity on the parts of the evaluation process, and (3) increase internal validity. Program theory measures the effect of program intervention on outcome as mediated by determinants. For example, a program implemented instructional and training public school students on proper waste disposal, the quality of the training is assessed. The determinants of the stakeholders are then identified such as adaptability, learning strategies, patience, and self-determination. These factors are measured as determinants. The outcome measures are then identified such as the reduction of wastes, improvement of waste disposal practices, attitude change, and rating of environmental sanitation. The effect of the intervention on the determinants is assessed and the effect of determinants on the outcome measures. The direct effect of the intervention and the outcome is also assessed. The model of this evaluation is illustrated below. Figure 1 Implicit Theory for Proper Waste Disposal Determinants Intervention Quality of Instruction and Training
Adaptability, learning strategies, patience, and self-determination
Outcome Reduction of wastes, improvement of waste disposal practices, attitude change, and rating of environmental sanitation
12
Table 2 Integration of the Forms and Models of Evaluation Form of Evaluation Proactive
Clarificative
Interactive
Monitoring
Impact
Focus Is there a need? What do we/others know about the problems to be addressed? Best practices? What is program trying to achieve? Is delivery working, consistent with plan? How could the program or organization be changed to be more effective? What is the program trying to achieve? Is delivery working, consistent with plan? How could the program or organization be changed to be more effective? Is the program reaching the target population? Is implementation meeting benchmarks? Differences across sites, time? How/what can be changed to be more efficient, effective? Is the program implemented as planned? Are stated goals achieved? Are needs served? Can you attribute goal achievement to program? Unintended outcomes? Cost effective?
Models of Evaluation Consumer-oriented Identifying Context in CIPP Setting goals in Tyler’s Evaluation Approach
Stake’s Responsive Evaluation Objectives-oriented
CIPP
CIPP Objectives-oriented Program theory
13
Table 3 Implementing procedures of the Different Models of Evaluation Form of Evaluation Objectives-oriented
Focus Tylerian Evaluation Approach
Metfessel and Michael’s Evaluation Paradigm
Provus Discrepancy Evaluation Model
Hammond’s Evaluation Cube
Logic Model
Management-oriented
CIPP (Context, Input, Process, and Product) by Stufflebeam
Alkin’s UCLA Evaluation Model
Patton’s Utilization-focused evaluation
Consumer-oriented
Scriven’s Key Evaluation Checklist
Morrisett and Stevens Curriculum Materials Analysis System (CMAS)
Expertise-oriented
Participant-oriented
Formal Professional reviews Informal Professional Reviews Ad Hoc Panel Reviews Ad Hoc Individual Reviews Educational Connoisseurship Stake’s Responsive Evaluation
Fetterman’s Empowerment Evaluation
Program Theory
•
•
Determinant mediating the relationship between intervention and outcome Relationship between program components that is conditioned by a third factor
Models of Evaluation 1. Establish broad goals 2. Classify the goals 3. Define objectives in behavioral terms 4. Find situations in which achievement of objectives can be shown 5. Develop measurement techniques 6. Collect performance data 7. Compare performance data with behaviorally shared objectives. 1. Involve stakeholders as facilitators in program evaluation 2. Formulate goals 3. Translate objectives into communicable forms 4. Select instruments to furnish measures 5. Carry out periodic observation 6. Analyze data 7. Interpret data using standards 8. Develop recommendations for further implementation 1. Agreeing on standards 2. Determine whether discrepancy exist between performance and standards 3. Use information on discrepancies to decide whether to improve, maintain, or terminate the program. 1. Needs of stakeholders 2. Characteristics of the clients 3. Source of service 1. Inputs 2. Service 3. Outputs 4. Immediate, intermediate, long-term, and ultimate outcomes 1. Context evaluation 2. Input evaluation 3. Process evaluation 4. Product evaluation 1. Systems assessment 2. Program planning 3. Program implementation 4. Program improvement 5. Program certification 1. Identifying relevant decision makers and information users 2. What information is needed by various people 3. Collect and provide information 1. Evidence of achievement 2. Follow-up results 3. Secondary and unintended efforts 4. Range of utility 5. Moral considerations 6. Costs 1. Describe characteristics of product 2. Analyze rationale and objectivity 3. Consider antecedent conditions 4. Consider content 5. Consider instructional theory 6. Form overall judgment Accreditation Peer reviews Funding agency review, blue ribbon panels Consultation Critics 1. Intents 2. Observations 3. Standards 4. Judgments 1. Training 2. Facilitation 3. Advocacy 4. Illumination 5. Liberation 1. Establish common understanding between stakeholders and evaluator 2. Clarifying stakeholders theory 3. Constricting research design
14
EMPIRICAL REPORTS Examples of Evaluation Studies Program Evaluation of the Civic Welfare Training Services By Carlo Magno
solidarity and collaboration with the immersion centers.
The NSTPCW1 and NSTPCW2 of a College was evaluated using Stakes’ Responsive Evaluation. The NSTP offered by the college is the Civic Welfare Training Service (CWTS) which focuses on developing students’ social concern, values, volunteerism and service for the general welfare of the community. The main purpose of the evaluation is to determine the impact of the current NSTPCW1 and NSTPCW2 program offered by DLS-CSB by assessing (1) students values, management strategies, and awareness of social issues, (2) students performance during the immersion, (3) students insights after immersion, (4) teaching performance, and (7) strengths and weaknesses of the program. The evaluation of the outcome of the program shows that the impact on values is high, the impact of the components of the NSTPCW2 is high, and the awareness of social issue are also high. The students’ insights show the acquisition of skills, values and awareness also concords with the impact gained. There is agreement that the students are consistently present and they show high rating on service, involvement and attitude during the immersion activity. The more the teacher uses a learner-centered approach, the better is the outcome on the students part. The strength of NSTPCW1 includes internal and external aspects and the weaknesses are on the teachers, class activities and social aspect. For NSTPCW2, the strengths are on student learning, activities and formation while the weaknesses are on the structure, activities, additional strategies and the outreach area. When compared with the Principle on Social Development of the Lasallian Guiding Principle, generally the NSTP program is acceptable in terms of the standards on understanding of social reality and social intervention and developing on
An evaluation of the Community Service Program of the De La Salle UniversityCollege of Saint Benilde By Josefina Otarra-Sembrabo Community Service Program is an outreach program in line with the mission-vision of De La Salle-College of Saint Benilde (DLSCSB). The Benildean core values are realized through a direct service to marginalized sectors in the society. The students are tasked to have immersion with the marginalized such as the street children, elderly, special people, and the like. After their service in the community, students reflect on what they do and formulate insights and relate it to the Lasallian education. This service is a social transformation for students and community. To evaluate the Community Service Program (CSP), Stufflebeam’s Context-InputProcess-Product Evaluation was utilized. This type of evaluation focuses on the decisionmanagement strategy. In the model, continuous feedback is important that is needed for better decisions and improvement of the program. This framework has four types which include context, input, process, and product. The context evaluation determines if the objectives of the program has been met. It aims to know if the objectives of the CSP have been achieved in relation to the mission and vision of DLS-CSB. The input evaluation describes the respondents and beneficiaries of the CSP. Process evaluation describes how the program was implemented in terms of procedures, policies, techniques, and strategies. This provides the evaluators the needed information to determine the procedural issues and to interpret the outcome of project. In the product evaluation, the outcome information
15
is being related to the objectives and context, input and process information. The information will be used to decide on whether to terminate, modify or refocus a program. There were a total of 250 participants in the study composed of students, beneficiaries, program staff members and selected clients. The instruments used were three sets of evaluation questionnaires for the students, program implementers, and beneficiaries and one interview guide used for the recipients of the CSP. Data analysis was both quantitative and qualitative in nature. For the context evaluation, the evaluators looked into the objectives of the CSP, mission-vision of CSB, objectives of Social Action Office, and their congruence. The DLS-CSB mission vision is realized in the six core Benildean values, and to realize the missionvision, SAO created a CSP to enhance social awareness of the students and instill social responsibility. Likewise, the objectives of CSP are aligned also to CSB mission and vision. 75% of the respondents said that CSP objectives are in line with the CSB mission-vision. This was supported with actual experiences. Moderate extent was given by the students and beneficiaries as to the extent the community service program has met. For the input evaluation, the profile of the students, program recipients, and implementers were reported. Most of the students were males, average age was 21 and from Manila. The recipients are mostly centers from Metropolis by the religious groups. Program implementers on the other hand are staff member responsible for the implementation of the program and has been into the college for 1-5 years. The process evaluation of the program focused on the policies and procedures of the CSP, role of the community service adviser, strength and weaknesses of the CSP, recommendation for improvement, and insights of the program beneficiaries. In terms of policies, the CSP is a requirement for the CSB students written in the Handbook. The program has 10 procedures including application, general
assembly, group meetings, leadership training, orientation seminar, initial area visit, immersion, group processing, and submission of documents. The students rated it as moderate as well. Seven out 10 of the procedures need improvement. In the role of the students, 68 of the students considered the role of advisers as helpful. However, the effectiveness of the performance was rated only moderately satisfactory. Three strong points given to the CSP are the provision of opportunities to gain social awareness, actualizing social responsibility and personal growth of the students. Subsequently, the weakness includes difficulty of program procedure, processes, locations and negative attitude of the students. Some of the recommendations focus on program preparation, program staff and community service locations. For the insights of the beneficiaries, some problems such as attendance and seriousness of the students are taken into account and resolved through dialogue, feedback and meetings. They also suggested to the CSP more intensive orientation and preparation as well as closer coordination and program continuity. Lastly, for the product evaluation, internalization and personification of the core Benildean values, benefits gained by the students and beneficiaries were taken into account. For the internalization and personification, it appears that four out of 6 core values are manifested by the students which are deeply rooted faith, appreciation of individual uniqueness, professional competency and creativity. Students also gained personal benefits such as increased social awareness, social responsibility actualization, positive values, and realizations of their blessings. On the other hand, the beneficiaries benefits include long term and short term benefits. Short ions are the socialization activities, interaction between the students and clients, material help, manpower assistance and tutorial classes while long term are values inculcated to the children, interpersonal relationships, knowledge imparted to them, and contribution to physical growth. The program beneficiaries also identifies strengths of
16
CSP such as development of inner feelings of happiness, love and concern as a result of their interaction with the students, knowledge imparted to them and extension of material help through the program. The weakness in one hand also includes the lack of preparation and interaction with the beneficiaries. These findings are the basis of conclusion. DLS-CSB has indeed a clear vision for their students and it was actualized in the CSP. There is a need to strengthen the relation of the CSP objectives and college vision0mission as implied in the moderate ratings in the evaluation. There seems be the need for expansion o0f the coverage of program recipients since it does not fully address the objectives set in the CSP. A review and update with procedures is needed due to the problems problems encountered by the students and beneficiaries. The CSP advisers were also not able to perform their roles well from the point of the students and representative of centers. The weakness pointed in this program implies that there is a need for improvement especially in the procedural stage. More intensive preparation should be done both tint he implementation and interacting with the marginalized sectors due to the need to better understand the sector they are to serve. Continuity of the program was highly recommended due to the short term and repetitive activities, which will allow them to successfully inculcate all of the core benildean values. However, the integration of these core values does not vary among the students in terms of sex, year of entry and course. All in all, the community service program proved to be beneficial for the students, beneficiaries and recipients of the program. In regard to the finding and conclusion, there are some recommendations with the CSP. Recommendations include continuity, changes and improvement by taking into consideration the flaws and weakness of the previous program. Intensive preparation for the service, review of the load of the students so they could give quality service to the sectors, improvement in the procedural stages, implementation of CSP on a
regular basis, student training, production of documentations and organized reports of the students, systematize community service, more volunteers, exp[and the coverage of marginalized sectors, considering other locations of marginalized sectors, informing the students their specific roles in the community service, involvement of the community service unit to seminars and conferences, periodic program evaluation, assessment of students involvement in the sectors, systematize be=needs assessment and conduct longitudinal studies with the effects of CSP in the lives of previous CSP volunteers. Shorter Summary This article deals on how the DLSU-CSB community service program (CSP) has been evaluated through hr use of Stufflebeam’s Context-Input-Process product Evaluation Model. This type of evaluation focuses on the decisionmanagement strategy. In here, continuous feedback is important that is needed for better decisions and improvement of the program. This framework has four types which include context, input, process and product. The context evaluation determines if the objectives of the program has been met. In here, it aims to know if the objectives of the CSP have been achieved in relation to the mission and vision of DLSU. The input evaluation describes the respondents and beneficiaries of the CSP. Process evaluation describes how the program was implemented in terms of procedures, policies, techniques, and strategies. This provides the evaluators the needed information to determine the procedural issues and to interpret the outcome of project. In the product evaluation, the outcome information is being related to the objectives and context, input and process information. The information will be used to decide on whether to terminate, modify or refocus a program. To do this, a total of 250 participants in the study composed of students, beneficiaries, program staff members and selected clients are included in the evaluation. The instruments used were three sets of evaluation questionnaires for the students, program implementers, and beneficiaries and
17
one interview guide used for the recipients of the CSP. Data analysis was both quantitative and qualitative in nature. For the context evaluation, the evaluators looked into the objectives of the CSP, mission-vision of CSB, objectives of SAO, and their congruence with each other. For the input evaluation, the profile of the students, program recipients, and implementers were reported. The process evaluation of the program focused on the policies and procedures of the CSP, role of the community service adviser, strength and weaknesses of the CSP, recommendation for improvement, and insights of the program beneficiaries. Lastly, for the product evaluation, internalization and personification of the core Benildean values, benefits gained by the students and beneficiaries were taken into account. The findings were used as the basis of conclusion. DLS-CSB has indeed a clear vision for their students and it was actualized in the CSP. There is a need to strengthen the relation of the CSP objectives and college vision0mission as implied in the moderate ratings in the evaluation. There seems be the need for expansion o0f the coverage of program recipients since it does not fully address the objectives set in the CSP. A review and update with procedures is needed due to the problems problems encountered by the students and beneficiaries. The CSP advisers were also not able to perform their roles well from the point of the students and representative of centers. The weakness pointed in this program implies that there is a need for improvement especially in the procedural stage. More intensive preparation should be done both tint he implementation and interacting with the marginalized sectors due to the need to better understand the sector they are to serve. Continuity of the program was highly recommended due to the short term and repetitive activities, which will allow them to successfully inculcate all of the core Benildean values. However, the integration of these core values does not vary among the students in terms of sex, year of entry and course. All in all, the community service program proved to be beneficial for the students, beneficiaries and
recipients of the program. In regard to the finding and conclusion, there are some recommendations with the CSP. Recommendations include continuity, changes and improvement by taking into consideration the flaws and weakness of the previous program. Intensive preparation for the service, review of the load of the students so they could give quality service to the sectors, improvement in the procedural stages, implementation of CSP on a regular basis, student training, production of documentations and organized reports of the students, systematize community service, more volunteers, expand the coverage of marginalized sectors, considering other locations of marginalized sectors, informing the students their specific roles in the community service, involvement of the community service unit to seminars and conferences, periodic program evaluation, assessment of students involvement in the sectors, systematize needs assessment and conduct longitudinal studies with the effects of CSP in the lives of previous CSP volunteers.
World Bank Evaluation Studies on Educational Policy By Carlo Magno This report provides a panoramic view of different studies on education sponsored by the world bank focusing on the evaluation component. The report specifically presents completed studies on educational policy from 1990 to 2006. A panoramic view of the studies are presented showing the area of investigation, evaluation model, method used, and recommendations. A synthesis of these reports is shown in terms of the areas of investigation, content, methodology, and model used through vote counting. The vote counting is a modal categorization assumed to give the best estimate of selected criteria (Bushman, 1997). The World Bank provides support to education systems throughout the developing world. Such support is broadly aimed at helping countries attain the objectives of “Education for
18
All” and education for success in the knowledge economy. An important goal is to tailor Bank assistance to region- and country-specific factors such as demographics, culture, and the socioeconomic or geopolitical climate. Consequently, a top priority is to inform development assistance with the benefit of country-specific analysis examining (1) what factors drive education outcomes; (2) how do they interact with each other; (3) which factors carry the most weight and which actions are likely to produce the greatest result; and (4) where do the greatest risks and constraints lie. The world bank divided the countries according to different regions such as Sub-Saharan Africa, East Asia and the Pacific, Europe and Central Asia, Latin America and the Carribean, and Middle East and North Africa.
recognition of globalization on some countries like Vanuatu.
Areas of Investigation
2000
There are 28 studies done on educational policy with a manifested evaluation component. Education studies with no evaluation aspect were not included. A synopsis of each study with the corresponding methodology and recommendations are found in appendix A. The different areas of investigation were enumerated and the number of studies conducted for each according to the sequence of years were counted as shown in Table 1. Most of the studies on educational policy are targeting the basic needs of a country and specified region of the world such as the effectiveness of education in the basic education, tertiary, critical periods such as child development programs and promoting adult literacy. From the earliest period (1990’s) the trend of the studies done are on information and communications technology (ICT) on basic education. The pattern for the 21st century studies shows a concentration in evaluating the implementation of tertiary education across countries. This is critical since developing nations rely on the expertise produced by its manpower in the field of science and technology. For the latest period, a new area of investigation which is language learning was explored due to the
Table 1 Counts of Area of Investigation From 1990 - 2006 Year
2006 2005 2004
2003
2002
2001
1999
1998
1997 1996
1995 1994 1993 1992
Country
Vanuatu Indonesia Thailand Senegal Different Regions, Columbia Thailand Different Regions Different regions Africa Brazil China Different Regions Different Regions Pakistan, Cuba Africa Africa USA Different Regions Different Regions Different Regions Different Regions Chile Philippines Different Regions
Area of Investigation
No. of Studies
Language learning None Undergraduate/Tertiary Education Adult Literacy Early Child Development
1 0 2
Undergraduate/Tertiary Education AIDS/HIV Prevention
1
Textbook/Reading materials
1
Secondary Education Early Child Development Secondary Education School Self-evaluation
1 1 1 1
Early Child Development
1
Basic Education
3
Adult Literacy Tertiary Distance Education Test Evaluation Infant Care
1 1 1 1
Early Child Development Teacher Development
1 1
ICT
1
None Basic Education (school financing) ICT None Vocational Education None Secondary Education
0 1
Total no. of Studies per year 1 0 5
1 2
2
1 2
2 7
3
2
0 2
1 1 1
0 1 0 1 Total=28
It is shown in table 1 that most studies on educational policy were conducted for the year 2000 since it is a turning point of the century. For the coming of a new century much is being prepared and this is operationalized by assessing a world wide report on what has been accomplished from the recent 20th century. The studies typically covers a broad range of education topics such as school self-evaluation, early child development, basic education, adult literacy, and tertiary distance education. These areas of investigation cover most of the fields
19
done for the 20th century and an overall view of what has been accomplished was reported. It can also be noted that there is an increase of studies conducted at the start of the 21st century. This can be explained with the growing trend in globalization where communication across countries are more accessible. It can also be noted that no studies were completed on educational policy with evaluation for the years 1993, 1995, 1997 and 2005. The trend in the number of studies shows that consequently after a year, the study gives more generalized findings since the study covered a larger and wide array of sampling where these studies took a long period of time to finish. More results are expected before the end of 2005. The trend of studies across the years is significantly different with the expected number of studies as revealed using a one-way chi-square where the computed value (χ2=28.73, df=14) exceeds a probability of χ2=23.58 with 5% probability of error. Table 2 Counts of Area of Investigation From 1990 - 2006 Area of Investigation Language learning Undergraduate/Tertiary Education Adult literary Early Child Development AIDS/HIV Prevention Textbook/Reading material Secondary education School Self-evaluation Basic education Test Evaluation Infant Care ICT Teacher Development Vocational Education
Number of Studies 1 4 2 5 1 1 3 1 4 1 1 2 1 1
Table 2 shows the number of studies conducted for every area in line with educational policy with evaluation. Most of the studies completed and funded are in the area of early child development followed by tertiary education and basic education. This can be explained by the increasing number of early child care
programs around the world which is continuing and needs to be evaluated in terms of its effectiveness at a certain period of time. Much of the concern is on early child development since it is a critical stage in life which evidently results to hampering the development of an individual if not cared for at an early age. This also shows the increasing number of children where their needs are undermined and intervention has to take place. These programs sought the assistance of the world bank because they need further funding for the program to exist. Having an evaluation of the child program likely supports the approval for further grant. Somehow there are a large number of studies on basic and tertiary education where its effectiveness is evaluated. Almost all countries offer the same structure of education world wide in terms of the level from basic education to tertiary education. These deeply needs attention since it is a basic key to developing nations to improve the quality of their education because the quality of their people with skills depend on the countries overall labor force. When the observed counts of studies for each area of interest is tested for significant goodness of fit, the computed chi-square value (χ2=13, df=13) did not reach significance at 5% level of significance. This means that the observed counts per area do not significantly differ to what is expected to be produced. Table 3 Study Grants by Country Country Vanuatu Indonesia Thailand Senegal Different Regions Brazil China Pakistan Cuba Africa USA Chile Philippines
No. of studies 1 1 1 1 10 1 1 1 1 2 1 1 1
20
The studies done for each country are almost equally distributed except for Africa with two studies from 1990 until the present period. There is a bulk of studies done worldwide which covers a wider array of sampling across different countries. The world wide studies usually evaluate common programs across different countries such as teacher effectiveness and child development programs. Although there is great difficulty to come up with an efficient judgment of the overall standards of each program separately. The advantage of having a world wide study on educational programs for different regions is to have a simultaneous description of the common programs that are running where the funding is most likely concentrated to one team of investigators rather than separate studies with different fund allocations. Another is the efficiency of maintaining consistency of procedures across different settings. Unlike different researchers setting different standards for each country. In the case of Africa two studies were granted concentrating on adult literacy and distance education because these educational programs are critical in their country as compared to others. As shown in the demographics of the African region that their programs (adult literacy, distance education) are increasingly gaining benefits to its stakeholders. There is a report of remarkable improvement on their adult education and more tertiary students are benefiting form the distance education. Since they are showing effectiveness, much funding is needed to continue the programs. When the number of studies are tested for significance across countries, the chi-square computed (χ2=35.44, df=12) reached significance against a critical value of χ2=21.03 at 5% probability of error. This means that the number of studies for each country differs significantly to what is expected to be produced. This is also due to having a large concentration of studies for different regions as compared to minimal studies for each country which made the difference.
Method of Studies Various methodologies are used to investigate the effectiveness of educational programs across different countries. Although it can be seen in the report that there is not much concentration and elaboration on the use and implementation of the procedures done to evaluate the programs. Most only mentioned the questionnaires and assessment techniques they used. There are some that mentioned a broad range of methodologies such as quasiexperiments and case studies but the specific designs are not indicated. It can also be noted that reports written by researchers/professors from universities are very clear in their method which is academic in nature but world bank personnel writing the report tends to focus on the justification of the funding rather than the clarity of the research procedure undertaken. It can also be noted that the reports did not show any part on the methodology. Most presented the introduction and some justifications of the program and later in the end the recommendations. The methodologies are just mentioned and not elaborated within the report and only mentioned on some parts of the justification of the program. Table 4 Counts of Methods Used Method Questionnaires/Inventories/Tests Quasi Experimental True Experimental Archival Data (Analyzed available demographics) Observations Case Studies Surveys Multimethod
Counts 4 5 1 6 1 1 1 9
It can be noted in table 4 that most studies employ a multimethod approach where
21
different methods are employed in a study. The multimethod approach creates an efficient way of cross-validating results for every methodology undertaken. One result in one method can be in reference to another result to another method which makes it powerful than using singularity. Since evaluation of the program is being done in most studies, it is indeed better to consider using a multimethod since it can generate findings where the researcher can arrive with better judgment and description of the program. It can also be noted that most studies are also using archival data to make justifications of the program. Most these researchers in reference to the archival data are coming up with inferences from enrollment percentage, drop out rates, achievement levels, and statistics on physical conditions such as weight and height etc. which can be valid but they do not directly assess the effectiveness of the program. The difficulty of using these statistics is that they do not provide a post measurement of the program evaluated. These may be due to the difficulty of arriving with national surveys on achievement levels and enrollment profiles of different educational institutions which is done annually but may not be in concordance with the timetable of the researchers. It is also commendable that a number of studies are considering to have quasiexperimental designs to directly assess the effectiveness of educational programs. The counts of the methodologies used is tested for significance, the computed chi-square value (χ2=18.29, df=7) reached significance over the critical chi-square value of χ2=14.07 with 5% probability of error. This shows that the methodologies used significantly varies to what is expected. The Use of Evaluation Models The evaluation method used by the studies were counted. There was difficulty in identifying the models used since the researchers did not specifically elaborate the evaluation or framework that they are using. It
can also be noted that the researchers are not really after the model but in establishing the program or continuity of the program. There are marked difference between university academicians and world bank personnel doing the study where the latter are misplaced in their assessment due to the lack of guidance from a model and academicians would specifically state the context but somehow failed to elaborate in the process for adopting a CIPP model. Most studies are clear in their program objectives but failed to provide accurate measures of the program directly. The worst is that most studies are actually not guided with the use of a model in evaluating the educational programs proposed. Table 5 Counts Models/Frameworks Used Model/Framework Objectives-Oriented Evaluation Management-Oriented Evaluation Consumer-Oriented Evaluation Expertise-Oriented Evaluation Participant-Oriented Evaluation No model specified
Counts 10 9 0 7 1 3
As shown in table 5 that majority of the evaluation used the objectives-oriented where they specify the program objectives and evaluated accordingly. A large number also used the management oriented and specifically made use of the CIPP by Stufflebeam (1968). A number of studies also used experts as external evaluators of the program implementation. Most of the studies actually did not mention the model used and the models were just identified as described by the procedure in conducting the evaluation. Most studies used the objectives oriented since the thrust is on educational policy
22
and most educational programs start with a means of stating objectives. These objectives are also treated as ends where the evaluation is basically used as the basis. The other studies which used the management-oriented evaluation are the ones who typically describe the context of the educational setting as to the available archival data provided by national and countrywide surveys. The inputs and outputs are also described but most are weak in elaborating the process undertaken. The counts on the use of evaluation models (χ2=18, df=5) reached significance at 5% error. This means that the counts are significantly different with the expected. This shows a need to use other models of evaluation as appropriate to the study being conducted.
will provide a better picture on the worth of a program since the judgment on how the program is taking place is concentrated on and not other matters which undermines the result of the program. A good alternative is for the research grantee to allocate another budget on a follow up program evaluation after establishing the program.
Recommendations
Bray, M. (1996). Decentralization of Education Community Financing. World Bank Reports.
1. It is recommended to increase distribution of study grants across countries. There is concentration of performing studies regionally which may neglect cultural and ethical considerations on testing and other forms of assessment. As a consequence there is no cross-cultural perspective on how the programs are implemented for each country because the focus is on the consistency of the programs. Conducting individual studies will show a more in-depth perspective of the program and how it is situated within a specific context. 2. It is recommended to have a specific section on the methodology undertaken by the researcher. This helps future researchers to qualify for the validity of the procedures undertaken by the study. Specifying clearly the method used enables the study to be replicated as best practices for future researchers and can easily identify procedures that needs to be improved. 3. It is recommended to have separate studies concentrating exclusively on program evaluation after successive program implementations. This
4. It is recommended that when screening for studies a criteria on the use of an evaluation model should be included. The researchers making an evaluation study can be guided better with the use of an evaluation model.
References
Brazil Early Child Development: A Focus on the Impact of Preschools. (2001). World Bank Reports. Bregman, J. & Stallmeister, S. (2002). Secondary Education in Africa: Strategies for Renewal. World Bank Reports. Bushman, B. J. (1997). Vote-counting procedures in meta-analysis. In H. Cooper and Hedges, L. V. (eds.) The Handbook of Research Synthesis. New York: Russell Sage Publications. Craig, H. J., Kraft, R. J., & du Plessis, J. (1998). Teacher Development: Making An Impact. World Bank Reports. Education and HIV/AIDS: A Sourcebook of HIV/AIDS Prevention Programs. (2003). World Bank Reports. Fretwell, D. I. & Colombano, J. E. (2000). Adult Continuing Education: An Integral Part Of Lifelong Learning Emerging Policies and
23
Programs for the 21st Century in Upper and Middle Income Countries. World Bank Reports.
Riley, K. & MacBeath, J. (2000). Putting School self-evaluation in Place. World Bank Reports.
Gasperini, L. (2000). The Cuban Education System: Lessons and Dilemmas. World Bank Reports.
Saint, W (2000). Tertiary Distance Education and Technology and sub Saharan Africa. World Bank Reports.
Getting an Early Start on Early Child Development. (2004). World Bank Reports.
Saunders, L. (2000). Effective Schooling in Rural Africa Report 2: Key Issues Concerning School Effectiveness and Improvement. World Bank Reports.
Grigorenko, E. L. & Sternberg, R. J. (1999). Assessing cognitive Development In Early Childhood. World Bank Reports. Indonesia - Quality of Undergraduate Education Project. (2004). World Bank Reports. Liang, X. (2001).China: Challenges of Secondary Education. World Bank Reports. Nordtveit, B. J. (2004). Managing Public–Private Partnership Lessons from Literacy Education in Senegal. World Bank Reports. O'Gara, C., Lusk, D., Canahuati, J., Yablick, G. & Huffman, S. L. (1999). Good Practices in Infant and Toddler Group Care. World Bank Reports. Operational Guidelines for textbooks and reading materials. (2002). World Bank Reports. Orazem, P. F. (2000). The Urban and Rural Fellowship School Experiments in Pakistan: Design, Evaluation, and Sustainability. World Bank Reports. Osin, L. (1998). Computers in Education in Developing Countries: Why and How? World Bank Reports. Philippines - Vocational Training Project. (1994). World Bank Reports. Potashnik, M. (1996). Chile's Learning Network. World Bank Reports.
Stufflebeam, D. L. (1968). Evaluation as enlightenment for decision making. Columbus: Ohio State University Evaluation Center. Tertiary Education in Colombia Paving the Way for Reform. (2003). World Bank Reports. Thailand - Universities Science and Engineering Education Project. (2004). World Bank Reports. Vanuatu: Learning and Innovation Credit for a Second Education Project. (2006). World Bank Reports. Ware, S. A. (1992). Secondary School Science in Developing Countries Status and Issues. World Bank Reports. Xie, O., & Young, M. E. (1999). Integrated Child Development in Rural China. World Bank Reports. Young, E. M. (2000). From Early Child Development to Human Development: Investing in Our Children's Future. World Bank Reports.
24
Activity # 2 1. Look for an evaluation study that is published in the Asian Development Bank webpage. 2. Summarize the study report in the following: - What features of the study made it an evaluation? - What form and model of evaluation was used? - How was the form or model implemented in the study? - What aspects of the evaluation study was measured?
25
Lesson 3 The Process of Assessment The previous lesson clarified the distinction between measurement and evaluation. Upon knowing the process of assessment in this lesson, you should know now how measurement and evaluation are used in assessment. Assessment goes beyond measurement. Evaluation can be involved in the process of assessment. Some definitions from assessment references show the overlap between assessment and evaluation. But Popham (1998), Gronlund (1993), and Huba and Freed (2000) defined assessment without overlap with evaluation. Take note of the following definitions: 1. Classroom assessment can be defined as the collection, evaluation, and use of information to help teachers make better decisions (McMillan, 2001). 2. Assessment is a process used by teachers and students during instruction that provides feedback to adjust ongoing teaching and learning to improve students’ achievement of intended instructional outcomes (Popham, 1998). 3. Assessment is the systematic process of determining educational objectives, gathering, using, and analyzing information about student learning outcomes to make decisions about programs, individual student progress, or accountability (Gronlund, 1993). 4. Assessment is the process of gathering and discussing information from multiple and diverse sources in order to develop a deep understanding of what students know, understand, and can do with their knowledge as a result of their educational experiences; the process culminates when assessment results are used to improve subsequent learning (Huba & Freed, 2000). Cronbach (1960) have three important features of assessment that makes it distinct with evaluation: (1) Use of a variety of techniques, (2) reliance on observation in structured and unstructured situations, and (3) integration of information. The three important features of assessment emphasize that assessment is not based on single measure but a variety of measures. In the classroom, a student’s grade is composed of the quizzes, assignments, recitations, long tests, projects, and final exams. These sources were assessed through formal and informal structures and integrated to come up with an overall assessment as represented by a student’s final grade. In lesson 1, assessment was defined as “the process of the collecting various information needed to come up with an overall information that reflects the attainment of goals and purposes.” There are three critical characteristics of this definition: 1. Process of collecting various information. A teacher arrives at an assessment after having conducted several measures of student’s performance. Such sources are recitations, long tests, final exams, and projects. Likewise, a student is proclaimed as gifted after having tested with a battery (several) of intelligence and ability tests. A student to be designated at Attention Deficit Disorder (ADD) needs to be diagnosed by several attention span and cognitive tests with a series of clinical interviews by a skilled clinical psychologist. A variety of information is needed in order to come up with a valid way of arriving with accurate information. 2. Integration of overall information. Coming up with an integrated assessment from various sources need to consider many aspects. The results of individual measures should be consistent with each other to meaningfully contribute in the overall assessment. In such cases, a
26
battery of intelligence tests should yield the same results in order to determine the overall ability of a case. In cases where some results are inconsistent, there should be a synthesis of the overall assessment indicating that in some measures the result do not support the overall assessment. 3. Attainment of goals and purposes. Assessment is conducted based on specified goals. Assessment processes are framed for a specified objective to determine if they are met. Assessment results are the best way to determine the extent to which a student has attained the objectives intended. The Process of Assessment The process of assessment was summarized by Bloom (1970). He indicated that there are two processes involve d in assessment: 1. Assessment begins with an analysis of criterion. The identification of criterion includes the expectations and demands and other forms of learning targets (goals, objectives, expectations, etc.). 2. It proceeds to the determination of the kind of evidence that is appropriate about the individuals who are placed in the learning environment such as their relevant strengths and weaknesses, skills, and abilities. In the classroom context, it was explain in Lesson 1 that assessment takes place before, during and after instruction. This process emphasize that assessment is embedded in the teaching and the learning process. Assessment generally starts in the planning of learning processes when learning objectives are stated. A learning objective is defined in measurable terms to have an empirical way of testing them. Specific behaviors are stated in the objectives so that it corresponds with some form of assessment. During the implementation of the lesson, assessment can occur. A teacher may provide feedback based on student recitations exercises, short quizzes, and classroom activities that allow students to demonstrate the skill intended in the objectives. The assessment done during instruction should be consistent with the skills required in the objectives of the lesson. The final assessment is then conducted after enough assessment can demonstrate student mastery of the lesson and their skills. The final assessment conducted can be the basis for the succeeding objectives for the next lesson. The figure below illustrates the process of assessment. Figure 1 The Process of Assessment in the Teaching and Learning Context Assessment
Assessment
Learning Objectives
Learning Experience
Assessment
27
Forms of Assessment Assessment comes in different forms. It can be classified as qualitative or quantitative, structured or unstructured, and objective or subjective. Quantitative and Qualitative Assessment is not limited to quantitative values, assessment can also be qualitative. Examples of qualitative assessments are anecdotal records, written reports, written observations in narrative forms. Qualitative assessments provide a narrative description of attributes of students, such as their strengths and weaknesses, areas that need to be improved and specific incidents that support areas of strengths and weaknesses. Quantitative values uses numbers to represent attributes. The advantages of quantification were described in Lesson 2. Quantitative values as results in assessment facilitate accurate interpretation. Assessment can be a combination of both qualitative and quantitative results. Structured vs. Unstructured Assessment can come in the form of structured or unstructured way of gathering data. Structured forms of assessment are controlled, formal, and involve careful planning and organized implementation. Examples of formal assessment are the final exams where it is announced, students are provided with enough time to study, the coverage is provided, and the test items are reviewed. A formal graded recitation can be a structured form of assessment when it is announced, questions are prepared, and students are informed of the way they are graded in their answers. Unstructured assessment can be informal in terms of its processes. An example would be a short unannounced quiz just to check if students have remembered the past lesson, informal recitations during discussion, and assignments arising from the discussion. Objective vs. Subjective Assessment can be objective or subjective. Objective assessment has less variation in results such as objective tests, seatworks, and performance assessment with rubrics with right and wrong answers. Subjective assessment on the other hand results to larger variation in results such as essays and reaction papers. Careful procedures should be undertaken as much as possible ensure objectivity in assessing essays and reaction papers. Components of Classroom Assessment Tests Tests are basically tools that measure a sample of behavior. Generally there are a variety of tests provided inside the classroom. It can be in the form of a quiz, long tests (usually covering smaller units or chapters of a lesson), and final exams. Majority of the tests for students are teacher-made-tests. These tests are tailored for students depending on the lesson covered by the syllabus. The tests are usually checked by colleagues to ensure that items are properly constructed.
28
Teacher made tests vary in the form of a unit, chapter, or long test. These generally assess how much a student learned within a unit or chapter. It is a summative test in such a way that it is given after instruction. The coverage is only what has been taught in a given chapter or tackled within a given unit. Tests also come in the form of a quiz. It is a short form assessment. It usually measures how much the student acquired within a given period or class. The questions are usually from what has been taught within the lesson for the day or topic tackled in a short period of time, say for a week. On the other hand, it can be summative or formative. It can be summative if it aims to measure the learning from an instruction, or formative if to aims to tests how much the students already know prior the instruction. The results of quiz can be used by the teacher to know where to start the lesson (example, the students already know how to add single digits, and then she can already proceed to adding double digits). It can also determine if the objectives for the day are met. Recitation A recitation is the verbal way of assessing students’ expression of their answers to some stimuli provided in the instruction or by the teacher. It is a kind of assessment in which oral participation of the student is expected. It serves many functions such as before the instruction to ask the prior knowledge of the students about the topic. It can also be done during instruction, wherein the teacher solicits ideas from the class regarding the topic. It can also be done after instruction to assess how much the student learned from the lesson for the day. Recitations are facilitated by questions provided by the teacher and it is meant that students undergo thinking in order to answer the questions. There are many purposes of recitation. A recitation is given if teachers wanted to assess whether students can recall facts and events from the previous lesson. A recitation can be done to check whether a student understands the lesson, or can go further in higher cognitive skills. Measuring high order cognitive skills during recitation will depend in the kind of question that the teacher provides. Appraising a recitation can be structured or unstructured. Some teachers announce the recitation and the coverage beforehand to allow students prepare. The questions are prepared and a system of scoring the answers are provided as well. Informal recitations are just noted by the teacher. Effective recitations inside the classroom are marked by all students having an equal chance of being called. Some concerns of teacher regarding the recitation process are as follows: Should the teacher call more on the students who are silent most of the time in class? Should the teacher ask students who could not comprehend the lesson easily more often? Should recitation be a surprise? Are the difficult questions addressed to disruptive students? Are easy questions only for students who are not performing well in class? Projects Projects can come in a variety of form depending on the objectives of the lesson, a reaction paper, a drawing, a class demonstration can all be considered as projects depending on the purpose. The features of a project should include: (1) Tasks that are more relevant in the real life setting, (2) requires higher order cognitive skills, (2) can assess and demonstrate affective
29
and psychomotor skills which supplements instruction, and (4) requires application of the theories taught in class. Performance Assessment Performance assessment is a form of assessment that requires students to perform a task rather than select an answer from a ready-made list. Examples would be students demonstrating their skill in communication through a presentation, building of a dayorama, dance number showing different stunts in a physical examination class. Performance assessment can be in the form of an extended-response exercise, extended tasks, and portfolios. Extended-response exercises are usually open-ended where students are asked to report their insights on an issue, their reactions to a film, and opinions on an event. Extended tasks are more precise that require focused skills and time like writing an essay, composing a poem, planning and creating a script for a play, painting a vase. These tasks are usually extended as an assignment if the time in school is not sufficient. Portfolios are collections of students’ works. For an art class the students will compile all paintings made, for a music class all compositions are collected, for a drafting class all drawings are compiled. Table 4 shows the different tasks using performance assessment. Table 4 Outcomes Requiring Performance Assessment Outcome Skills
Behavior Speaking, writing, listening, oral reading, performing experiments, drawing, playing a musical instrument, gymnastics, work skills, study skills, and social skills Work habits Effectiveness in planning, use of time, use of equipment resources, the demonstration of such traits as initiative, creativity, persistence, dependability Social Concern for the welfare of others, respect for laws, respect the property of attitudes others, sensitivity to social issues, concern for social institutions, desire to work toward social improvement Scientific Open-mindedness, willingness to suspend judgment, cause-effect relations, an attitudes inquiring mind Interests Expressing feelings toward various educational, mechanical, aesthetic, scientific, social, recreational, vocational activities Appreciations Feeling of satisfaction and enjoyment expressed toward music, art, literature, physical skill, outstanding social contributions Adjustments Relationship to peers, reaction to praise and criticism authority, emotional stability, social adaptability Assignments Assignment is a kind of assessment which extends classroom work. It is usually a take home task which the student completes. It may vary from reading a material, problem solving, research, and other tasks that are accomplishable in a given time. Assignments are used to supplement a learning task or preparation for the next lesson.
30
Assignments are meant to reinforce what is taught inside the classroom. Tasks on the assignment are specified during instruction and students carry out these tasks outside of the school. When the student comes back, the assignment should have helped the student learn the lesson better. Paradigm Shifts in the Practice of Assessment For over the years the practice of assessment has changed due to improvement in teaching and learning principles. These principles are a result of researches that called for more information on how learning takes place. The shift is shown from old practices to what should be ideal in the classroom. From
To
Testing
Alternative assessment
Paper and pencil
Performance assessment
Multiple choice
Supply
Single correct answer
Many correct answer
Summative
Formative
Outcome only
Process and Outcome
Skill focused
Task-based
Isolated facts
Application of knowledge
Decontextualized task
Contextualized task
External Evaluator
Student self-evaluation
Outcome oriented
Process and outcome
The old practice of assessment focuses on traditional forms of assessment such as paper and pencil with single correct answer and usually conducted at the end of the lesson. For the contemporary perspectives in assessment, assessment is not necessarily in the form of paper and pencil tests because there are skills that are better captured in through performance assessment such as presentations, psychomotor tasks, and demonstrations. Contemporary practice welcomes a variety of answers from students where they are allowed to make interpretation of their own learning. It is now accepted that assessment is conducted concurrently with instruction and not only serving as a summative function. There is also a shift from assessment items that are contextualized and having more utility. Rather than asking for the definitions of verbs, nouns, and pronouns, students are required to make an oral or written communication about their
31
favorite book. It also important that student assess their own performance to facilitate selfmonitoring and self-evaluation.
Activity: Conduct a simple survey and administer to teachers the questionnaire: Gender: ___ Male ____ Female Years of teaching experience: ________ Subject currently handled: ____________________ Always
Often
Sometimes
Rarely
Never
1. My students collect their works in a portfolio. 2. I look at both the process and the final work in assessing students tasks. 3. I welcome varied answers among my students during recitation. 4. I announce the criteria to my students on how they are graded in their work. 5. I provide feedback on my students performance often. 6. I use performance assessment when paper and pencil test are not appropriate. 7. I sue other forms of informal assessment. 8. The students’ final grade in my course is based on multiple assessment. 9. The students grade their group members during a group activity aside from the grade I give. 10. I believe that my students’ grades are not conclusive.
Uses of Assessment Assessment results have a variety of application from selection to appraisal and aiding the in the decision making process. These functions of assessment vary within the educational setting whether it is conducted for human resources, counseling, instruction, research, and learning. 1. Appraising Assessment is used for appraisal. Forms of appraisals are the grades, scores, rating, and feedback. Appraisals are used to provide a feedback on individual’s performance to determine how much improvement could be done. A low appraisal or negative feedback indicates that
32
performance still needs room for improvement while high appraisal or positive feedback means that performance needs to be maintained. 2. Clarifying Instructional Objectives Assessment results are used to improve the succeeding lessons. Assessment results point out if objectives are met for a specific lesson. The outcome of the assessment results are used by teachers in their planning for the next lesson. If teachers found out that majority of students failed in a test or quiz, then the teacher assesses whether the objectives are too high or may not be appropriate for students’ cognitive development. Objectives are then reformulated to approximate students’ ability and performance that is within their developmental stage. Assessment results also have implications to the objectives of the succeeding lessons. Since the teacher is able to determine the students’ performance and difficulties, the teacher improves the necessary intervention to address them. The teacher being able to address the deficiencies of students based on assessment results is reflective of effective teaching performance. 3. Determining and reporting pupil achievement of education objectives The basic function of assessment is to determine students’ grades and report their scores after major tests. The reported grade communicates students’ performance in many stakeholders such as with teachers, parents, guidance counselors, administrators, and other concerned personnel. The reported standing of students in their learning show how much they have attained the instructional objectives set for them. The grade is a reflection of how much they have accomplished the learning goals. 4. Planning, directing, and improving learning experiences Assessment results are basis for improvement in the implementation of instruction. Assessment results from students serve as a feedback on the effectiveness of the instruction or the learning experience provided by the teacher. If majority of students have not mastered the lesson the teacher needs to come up with a more effective instruction to target mastery for all the students. 5. Accountability and program evaluation Assessment results are used for evaluation and accountability. In making judgments about individuals or educational programs multiple assessment information is used. Results of evaluations make the administrators or the ones who implemented the program accountable for the stakeholders and other recipients of the program. This accountability ensures that the program implementation needs to be improved depending in the recommendations from evaluations conducted. Improvement takes place if assessment coincides with accountability. 6. Counseling Counseling also uses a variety of assessment results. The variables such as study habits, attention , personality, and dispositions, are assessed in order to help students improve them.
33
Students who are assessed to be easily distracted inside the classroom can be helped by the school counselor by focusing the counseling session in devising ways to improve the attention of a student. A student who is assessed to have difficulties in classroom tasks are taught to selfregulate during the counseling session. Students’ personality and vocational interests are also assessed to guide them in the future courses suitable for them to take. 7. Selecting Assessment is conducted in order to select students placed in the honor roll, pilot sections. Assessment is also conducted to select from among student enrollees who will be accepted in a school, college or university. Recipients of scholarships and other grants are also based on assessment results.
Guide Questions: 1. What are the other uses of Assessment? 2. Major decision in the educational setting needs to be backed up by assessment results? 3. What are the things assessed in your school aside from selection of students and reporting grades? References Bloom, B. (1970). Toward a theory of testing which include measurement-assessmentevaluation. In M. C. Wittrock, and D. E Wiley (Eds.), The evaluation of instruction: Issues and problems (pp. 25-69). New York: Holt, Rinehart, & Winston. Chen, H. (2005). Practical program evaluation. Beverly Hills, CA: Sage. Fitzpatrick, J. L., Sanders, J. R., & Worthen, B. R. (2004). Program evaluation: Alternative approaches and practical guidelines (3rd ed.). New York: Pearson. Gronlund, N. E. (1993). How to write achievement tests and assessment (5th ed.). Needham Heights: Allyn & Bacon. Huba, M. E. & Freed, J. E. (2000). Learner-Centered Assessment on College Campuses Shifting the Focus from Teaching to Learning. Boston: Allyn and Bacon. Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation standards (2nd ed.). Thousand Oakes, CA: Sage. Magno, C. (2007). Program evaluation of the civic welfare training services (Tech Rep. No. 3). Manila, Philippines: De La Salle-College of Saint Benilde, Center for Learning and Performance Assessment.
34
McMillan, J. H. (2001). Classroom assessment: Principles and practice for effective instruction. Boston: Allyn & Bacon. Nunnaly, J. C. (1970). Introduction to psychological measurement. New York: McGraw Hill. Popham, W. J. (1998). Classroom assessment: What teachers need to know (2nd ed.). Needham Heights, MA: Allyn & Bacon. Scriven, M. (1967). The methodology of evaluation: Perspectives of curriculum evaluation. Chicago: Rand McNally.
35
Chapter 2 The Learning Intents Chapter Objectives 1. 2. 3. 4. 5. 6.
Describe frameworks of the various taxonomic tools. Compare and contrast the various taxonomic tools for setting the learning intents. Justify the use of taxonomic tools in assessment planning. Formulate appropriate learning intents. Use the taxonomic tools in formulating the learning intents. Evaluate the learning intents on the basis of the taxonomic framework in use.
Lessons 1 2
3
The Conventional Taxonomic Tools Bloom’s Taxonomy The Revised Taxonomy The Alternative Taxonomic Tools Gagne’s taxonomic guide Stiggins & Conklin’s taxonomic categories The New Taxonomy The Thinking Hats Specificity of the Learning Intents
36
Lesson 1: The Taxonomic Tools Having learned about measurement, assessment, and evaluation, this chapter will bring you to the discussion on the learning intents, which refer to the objectives or targets the teacher sets as the competency to build on the students. This is the target skill or capacity that you want students to develop as they engage in the learning episodes. The same competency is what you will soon assess using relevant tools to generate quantitative and qualitative information about your students’ learning behavior. Prior to designing your learning activities and assessment tasks, you first have to formulate your learning intents. These intents exemplify the competency you wish students will develop in themselves. It this point, your deep understanding on how learning intents should be formulated is very useful. As you go through this chapter, your knowledge about the guidelines in formulating these learning intents will help you understand how assessment tasks should be defined. In formulating learning intents, it is helpful to be aware that appropriate targets of learning come in different forms because learning environments differ in many ways. What is crucial is the identification of which intents are more important than the others so that they are given appropriate priority. When you formulate statements of learning intents, it is important that you have a strong grasp of some theories of learning as these will aid you in determining what competency could possibly be developed in the students. If you are familiar with Bloom’s taxonomy, dust yourself off in terms of your understanding of it so that you can make a good use of it.
EVALUATION SYNTHESIS ANALYSIS
APPLICATION COMPREHENSION KNOWLEDGE
Figure 1 Bloom’s Taxonomy
37
Figure 1 shows a guide for teachers in stating learning intents based on six dimensions of cognitive process. Knowledge, being the one whose degree of complexity is low, includes simple cognitive activity such as recall or recognition of information. The cognitive activity in comprehension includes understanding of the information and concepts, translating them into other forms of communication without altering the original sense, interpreting, and drawing conclusions from them. For application, emphasis is on students’ ability to use previously acquired information and understanding, and other prior knowledge in new settings and applied contexts that are different from those in which it was learned. For learning intents stated at the Analysis level, tasks require identification and connection of logic, and differentiation of concepts based on logical sequence and contradictions. Learning intents written at this level indicate behaviors that indicate ability to differentiate among information, opinions, and inferences. Learning intents at the synthesis level are stated in ways that indicate students’ ability to produce a meaningful and original whole out of the available information, understanding, contexts, and logical connections. Evaluation includes students’ ability to make judgments and sound decisions based defensible criteria. Judgments include the worth, relevance, and value of some information, ideas, concepts, theories, rules, methods, opinions, or products. Comprehension requires knowledge as information is required in understanding it. A good understanding of information can facilitate its application. Analysis requires the first three cognitive activities. Both synthesis and evaluation require knowledge, comprehension, application, and analysis. Evaluation does not require synthesis, and synthesis does not require evaluation either. Recently after 45 years since the birth Bloom’s original taxonomy, a revised version has come into the teaching practice, which was developed by Anderson and Krathwohl. Statements that describe intended learning outcomes as a result of instruction are framed in terms of some subject matter content and the action required with the content. To eliminate the anomaly of unidimensionality of the statement of learning intents in their use of noun phrases and verbs altogether, Figure 3 shows two separate dimensions of learning: the knowledge dimension and the cognitive process dimension. Knowledge Dimension has four categories, three of which include the subcategories of knowledge in the original taxonomy. The fourth, however, is a new one, something that was not yet gaining massive popularity at the time when the original taxonomy was conceived. It is new and, at the same time, important in that it includes strategic knowledge, knowledge about cognitive tasks, and self-knowledge. Factual knowledge. This includes knowledge of specific information, its details and other elements therein. Students make use of this knowledge to familiarize the subject matter or propose solutions to problems within the discipline. Conceptual knowledge. This includes knowledge about the connectedness of information and other elements to a larger structure of thought so that a holistic view of the subject matter or discipline is formed. Students classify, categorize, or generalize ideas into meaningful structures and models.
38
Procedural knowledge. This category of includes the knowledge in doing some procedural tasks that require specific skills and methods. Students also know the criteria for using the procedures in levels of appropriateness. Metacognitive knowledge. This involves cognition in general as well as the awareness and knowledge of one’s own cognition. Students know how they are thinking and become aware of the contexts and conditions within which they are learning.
Figure 3. Sample Objectives Using The Revised Taxonomy Remember
Factual
Understand
Apply
Analyze
Evaluate
Create
#1
Conceptual
#2
#3
Procedural Metacognitive
#3
# 1: Remember the characters of the story, “Family Adventure.” # 2: Compare the roles of at least three characters of the story. # 3: Evaluate the story according to specific criteria. # 4: Recall personal strategies used in understanding the story. Cognitive Process Dimension is where specific behaviors are pegged, using active verbs. However, so that there is consistency in the description of specific learning behaviors, the categories in the original taxonomies which were labeled in noun forms are now replaced with their verb counterparts. Synthesis changed places with Evaluation, both are now stated in verb forms. Remember. This includes recalling and recognizing relevant knowledge from long-term memory. Understand. This is the determination of the meanings of messages from oral, written or graphic sources. Apply. This involves carrying out procedural tasks, executing or implementing them in particular realistic contexts. Analyze. This includes deducing concepts into clusters or chunks of ideas and meaningfully relating them together with other dimensions. Evaluate. This is making judgments relative to clear standards or defensible criteria to critically check for depth, consistency, relevance, acceptability, and other areas.
39
Create. This includes putting together some ideas, concepts, information, and other elements to produce complex and original, but meaningful whole as an outcome. The use of the revised taxonomy in different programs has benefited both teachers and students in many ways (Ferguson, 2002; Byrd, 2002). The benefits generally come from the fact that the revised taxonomy provides clear dimensions of knowledge and cognitive processes in which to focus in the instructional plan. It also allows teachers to set targets for metacognition concurrently with other knowledge dimensions, which is difficult to do with the old taxonomy.
Lesson 2: Assessment in the Classroom Context Both the Bloom’s taxonomy and the revised taxonomy are not the only existing taxonomic tools for setting our instructional targets. There are other equally useful taxonomies. One of these is developed by Robert M. Gagne. In his theory of instruction, Gagne desires to help teachers make sound educational decisions so that the probability that the desired results in learning are achieved is high. These decisions necessitate the setting of intentional goals that assure learning. In stating learning intents using Gagne’s taxonomy, we can focus on three domains. The cognitive domain includes Declarative (verbal information), Procedural (intellectual skills), and Conditional (cognitive strategies) knowledge. The psychological domain includes affective knowledge (attitudes). The psychomotor domain involves the use of physical movement (motor skills). Verbal Information includes a vast body of organized knowledge that students acquire through formal the instructional processes, and other media, such as television, and others. Students understand the meaning of concepts rather than just memorizing them. This condition of learning lumps together the first two cognitive categories of Bloom’s taxonomy. Learning intents must focus on differentiation of contents in texts and other modes of communication; chunking the information according to meaningful subsets; remembering and organizing information. Intellectual Skills include procedural knowledge that ranges from Discrimination, to Concrete Concepts, to Defined Concepts, to Rules, and to Higher Order Rules. Discrimination involves the ability to distinguish objects, features, or symbols. Detection of difference does not require naming or explanation. Concrete Concepts involve the identification of classes of objects, features, or events, such as differentiating objects according to concrete features, such as shape. Defined Concepts include classifying new and contextual examples of ideas, concepts, or events by their definitions. Here, students make use labels of terms denoting defined concepts for certain events or conditions. Rules apply a single relationship to solve a group of problems. The problem to be solved is simple, requiring conformance to only one simple rule.
40
Higher order rules include the application of a combination of rules to solve a complex problem. The problem to be solved requires the use of complex formula or rules so that meaningful answers are arrived at. Learning intents stated at this level of cognitive domain must given attention to abilities to spot distinctive features, use information from memory to respond to intellectual tasks in various contexts, make connections between concepts and relate them to appropriate situations. Cognitive Strategies consist of a number of ways to make students develop skills in guiding and directing their own thinking, actions, feelings, and their learning process as a whole. Students create and hone their metacognitive strategies. These processes help then regulate and oversee their own learning, and consist of planning and monitoring their cognitive activities, as well as checking the outcomes of those activities. Learning intents should emphasize abilities to describe and demonstrate original and creative strategies that students have tried out in various conditions Attitudes are internal states of being that are acquired through earlier experience of task engagement. These states influence the choice of personal response to things, events, persons, opinions, concepts, and theories. Statements of learning intents must establish a degree of success associated with desired attitude, call for demonstration of personal choice for actions and resources, and allow observation of real-world and human contexts. Motor Skills are well defined, precise, smooth and accurately timed execution of performances involving the use of the body parts. Some cognitive skills are required for the proper execution of motor activities. Learning intents drawn at this domain should focus on the execution of fine and well-coordinated movements and actions relative to the use of known information, with acceptable degree of mastery and accuracy of performance. Another taxonomic tool is one developed by Stiggins & Conklin (1992), which involves categories of learning as bases in stating learning intents. Knowledge
This includes simple understanding and mastery of a great deal of subject matter, processes, and procedures. Very fundamental to the succeeding stages of learning is the knowledge and simple understanding of the subject matter. This learning may take the form of remembering facts, figures, events, and other pertinent information, or describe, explain, and summarize concepts, and cite examples. Learning intents must endeavor to develop mastery of facts and information as well as simple understanding and comprehension of them.
Reasoning
This indicates ability to use deep knowledge of subject matter and procedures to make defensible reason and solve problems with efficiency. Tasks under this category include critical and creative thinking, problem solving, making judgments and decisions, and other higher order thinking skills. Learning intents must, therefore, focus on the use of knowledge and simple understanding of information and concepts to reason and solve problems in contexts.
Skills
This highlights the ability to demonstrate skills to perform tasks with acceptable degree of mastery and adeptness. Skills involve overt behaviors that show knowledge and deep understanding. For this category, learning intents have to
41
take particular interest in the demonstration of overt behaviors or skills in actual performance that requires procedural knowledge and reasoning Products
In this area, the ability to create and produce outputs for submission or oral presentations is given importance. Because outputs generally represent mastery of knowledge, deep understanding, and skills, they must be considered as products that demonstrate the ability to use those knowledge and deep understanding, and employ skills in strategic manner so that tangible products are created. For the statement of learning intents, teachers must state expected outcomes, either process- or product-oriented.
Affect
Focus is on the development of values, interests, motivation, attitudes, selfregulation, and other affective states. In stating learning intents on this category, it is important that clear indicators of affective behavior can easily be drawn from the expected learning tasks. Although many teachers find it difficult to determine indicators of affective learning, it is inspiring to realizing that it is not impossible to assess it.
These categories of learning by Stiggins and Conklin are helpful especially if your intents focus on complex intellectual skills and the use of these skills in producing outcomes to increase self-efficacy among students. In attempting to formulate statements of learning outcome at any category, you can be clear about what performance you want to see at the end of the instruction. In terms of assessment, you would know exactly what to do and what tools to use in assessing learning behaviors based on the expected performance. Although stating learning outcomes at the affective category is not as easy to do as in the knowledge and skill categories, but trying it can help you approximate the degree of engagement and motivation required to perform what is expected. Or if you would like to also give prominence to this category without stating another learning intent that particularly focus on the affective states, you might just look for some indicators in the cognitive intents. This is possible because knowledge, skills, and attitudes are embedded in every single statement of learning intent. Another alternative guide for setting the learning targets is one that had been introduced to us by Robert J. Marzano in his Dimensions of Learning (DOL). As a taxonomic tool, the DOL provides a framework for assessing various types of knowledge as well as different aspects of processing which comprises six levels of learning in a taxonomic model called the new taxonomy (Marzano & Kendall, 2007). These levels of learning are categorized into different systems. The Cognitive System The cognitive system includes those cognitive processes that effectively use or manipulate information, mental procedures and psychomotor procedures in order to successfully complete a task. It indicates the first four levels of learning, such as: Level 1: Retrieval. In this level of the cognitive system students engage some mental operations for recognition and retrieval of information, mental procedure, or psychomotor procedure. Students engage in recognizing, where they identify the characteristics, attributes, qualities, aspects, or elements of information, mental procedure, or psychomotor procedure;
42
recalling, where they remember relevant features of information, mental procedure, or psychomotor procedure; or executing, where they carry out a specific mental or psychomotor procedure. Neither the understanding of the structure and value of information nor the how’s and why’s of the mental or psychomotor procedure is necessary. Level 2: Comprehension. As the second level of the cognitive system, comprehension includes students’ ability to represent and organize information, mental procedure or psychomotor procedure. It involves symbolizing where students create symbolic representation of the information, concept, or procedures with a clear differentiation of its critical and noncritical aspects; or integrating, where they put together pieces of information into a meaningful structure of knowledge or procedure, and identify its critical and noncritical aspects. Level 3: Analysis. This level of the cognitive system includes more manipulation of information, mental procedure, or psychomotor procedure. Here students engage in analyzing errors, where they spot errors in the information, mental procedure, or psychomotor procedure, and in its use; classifying the information or procedures into general categories and their subcategories; generalizing by formulating new principles or generalizations based on the information, concept, mental procedure, or psychomotor procedure; matching components of knowledge by identifying important similarities and differences between the components; and specifying applications or logical consequences of the knowledge in terms of what predictions can be made and proven about the information, mental procedure, or psychomotor procedure. Level 4: Knowledge Utilization. The optimal level of cognitive system involves appropriate use of knowledge. At this level, students put the information, mental procedure, or psychomotor procedure to appropriate use in various contexts. It allows for investigating a phenomenon using certain information or procedures, or investigating the information or procedure itself; using information or procedures in experimenting knowledge in order to test hypotheses, or generating hypotheses from the information or procedures; problem solving, where students use the knowledge to solve a problem, or solving a problem about the knowledge itself; and decision making, where the use of information or procedures help arrive at a decision, or decision is made about the knowledge itself. The Metacognitive System The metacognitive system involves students’ personal agency of setting appropriate goals of their learning and monitoring how they go through the learning process. Being the 5th level of the new taxonomy, the metacognitive system includes those learning targets as specifying goals, where students set goals in learning the information or procedures, and make a plan of action for achieving those goals; process monitoring, where students monitor how they go about the action they decided to take, and find out if the action taken effectively serves their plan for learning the information or procedures; clarity monitoring, where students determine how much clarity has been achieved about the knowledge in focus; and accuracy monitoring, where students see how accurately they have learned about the information or procedures. The Self System Placed at the highest level in the new taxonomy, the Self System is the level of learning that sustains students’ engagement by activating some motivational resources, such as their self-
43
beliefs in terms of personal competence and the value of the task, emotions, and achievementrelated goals. At this level, students reason about their motivational experiences. They reason about the value of knowledge by examining importance of the information or procedures in their personal lives; about their perceived competence by examining efficacy in learning the information or procedures; about their affective experience in learning by examining emotional response to the knowledge under study; about their overall engagement by examining motivation in learning the information or procedures. In each system, three dimensions of knowledge are involved, such as information, mental procedures, and psychomotor procedures. Information The domain of informational knowledge involves various types of declarative knowledge that are ordered according to levels of complexity. From its most basic to more complex levels, it includes vocabulary knowledge in which meaning of words are understood; factual knowledge, in which information constituting the characteristics of specific facts are understood; knowledge of time sequences, where understanding of important events between certain time points is obtained; knowledge of generalizations of information, where pieces of information are understood in terms of their warranted abstractions; and knowledge of principles, in which causal or correlational relationships of information are understood. The first three types of informational knowledge focus on knowledge of informational details, while the next two types focus on informational organization. Mental Procedures The domain of mental procedures involves those types of procedural knowledge that make use of the cognitive processes in a special way. In its hierarchic structure, mental procedures could be as simple as the use of single rule in which production is guided by a small set of rules that requires a single action. If single rules are combined into general rules and are used in order to carry out an action, the mental procedures are already of tactical type, or an algorithm, especially if specific steps are set for specific outcomes. The macroprocedures is on top of the hierarchy of mental procedures, which involves execution of multiple interrelated processes and procedures. Psychomotor Procedures The domain of psychomotor procedures involves those physical procedures for completing a task. In the new taxonomy, psychomotor procedures are considered a dimension of knowledge because, very similar to mental procedures, they are regulated by the memory system and develop in a sequence from information to practice, then to automaticity (Marzano & Kendall, 2007). In summary, the new taxonomy of Marzano & Kendal (2007) provides us with a multidimensional taxonomy where each system of thinking comprises three dimensions of knowledge that will guide us in setting learning targets for our classrooms. Table 2a shows the matrix of the thinking systems and dimensions of knowledge.
44 Systems of Thinking
Dimensions of Knowledge Information
Mental Procedure
Psychomotor Procedure
Level 6 (Self System) Level 5 (Metacognitive System) Level 4: Knowledge Utilization (Cognitive System) Level 3: Analysis (Cognitive System) Level 2: Comprehension (Cognitive System) Level 1: Retrieval (Cognitive System)
Now, if you wish to explore on other alternative tools for setting your learning objectives, here’s another help for us to make our learning intents target on the more complex learning outcomes, this one from Edward de Bono (1985). There are six thinking hats, each of which is named for a color that represents a specific perspective. When these hats are “worn” by the student, information, issues, concepts, theories, and principles are viewed in ways that are descriptive of mnemonically associated perspectives of the different hats. Let’s say that your learning intent necessitates students to mentally put on a white hat whose descriptive mental processes include gathering of information and thinking how it can be obtained, and the emotional state is neutral, then learning behaviors may be classifying facts and opinions, among others. It is essential to be conscious that each hat that represents a particular perspective involves a frame of mind as well as an emotional state. Therefore, the perspective held by the students when a hat is mentally worn, would be a composite of mental and emotional states. Below is an attempt to summarize these six thinking hats.
45 THE
HATS
WHITE
RED
BLACK
YELLOW
GREEN
BLUE
Perspective
Observer
Self & others
Self & others
Self & others
Self & others
Observer
Representation
White paper, neutral
Fire, warmth
Stern judge wearing black rode
Sunshine, optimism
Vegetation
Sky, cool
Looking for needed objective facts and information, including how these can be obtained
Presenting views, feelings, emotions, and intuition without explanation or justification
Judging with a logical negative view, looking for wrongs & playing the devil’s advocate
Looking for benefits and productivity with logical positive view, seeing what is good in anything
Exploring possibilities & making hypotheses, composing new ideas with creative thinking
Establishing control of the process of thinking and engagement, using metacognition
Descriptive Behavior
Figure 5 Summative map of the Six Thinking Hats
These six thinking hats are beneficial not only in our teaching episodes but also in the learning intents that we set for our students. If qualities of thinking, creative thinking communication, decision-making, and metacognition are some of those that you want to develop in your students, these six thinking hats could help you formulate statements of learning intents that clearly set the direction of learning. Added benefits would be that when your intents are stated in the planes these hats, the learning episodes can be defined easily. Consequently, assessment is made more meaningful.
A. Formulate statements of learning intent using the Revised taxonomy, focusing on any category of knowledge dimension but on the higher categories of cognitive dimension. B. Bring those statements of learning intents to Robert Gagne’s taxonomy and see where they will fit. You may customize the statements a bit so that they fit well to any of Gagne’s categories of learning. C. Do the same process of fitting to Stiggins’ categories of learning, then The New Taxonomy. Remember to customize the statements when necessary. D. Draw insights from the process and share them in class.
46
Lesson 3: Specificity of the learning intent Learning intents usually come in relatively specific statements of desired learning behavior or performance we would like to see in our students at the end of the instructional process. In making these intents facilitate relevant assessment, it is important that they are stated with very active verbs, those that represent clear actions or behaviors so that indicators of performance are easily identified. These active verbs are and essential part of the statement of learning intents because they specify what the students actually do within and at the end of a specified period of time. In this case, assessment becomes convenient to do because it can specifically focus on the indicated behaviors or actions.
Gronlund, (in Mcmillan, 2005), uses the term instructional objectives to mean intended learning outcomes. He emphasizes that instructional objectives should be stated in terms of specific, observable, and measurable student responses.
In writing statements of learning intents for the course we teach, we aim to state behavior outcomes to which our teaching efforts are devoted, so that, from these statements, we can design specific tasks in the learning episodes for our students to engage into. However, we need to make sure that these statements will have to be set with proper level of generality so that they don’t oversimplify or complicate the outcome. A statement of intent could have a rather long range of generality so that many suboutcomes may be indicated. Learning intents that are stated in general terms will need to be defined further by a sample of the specific types of student performance that characterize the intent. In doing this, assessment will be easy because the performance is clearly defined. Unlike the general statements of intent that may permit the use of not-so-active verbs such as know, comprehend, understand, and so on, the specific ones use active verbs in order to define specific behaviors that will soon be assessed. The selection of these verbs is very vital in the preparation of a good statement of learning intent. Three points to remember might help select active verbs. 1. See that the verb clearly represents the desired learning intent. 2. Note that the verb precisely specifies acceptable performance of the student. 3. Make sure that the verb clearly describes relevant assessment to be made within or at the end of the instruction.
The statement, students know the meaning of terms in science is general. Although it gives us an idea of the general direction of your class towards the expected outcome, we might be confused as to what specific behaviors of knowing will be assessed. Therefore, it is necessary that we draw some representative sample of specific learning intent so that we will let students:
47
• write a definition of particular scientific term • identify the synonym of the word • give the term that fits a given description • present an example of the term • represent the term with a picture • describe the derivation of the term • identify symbols that represent the term • match the term with concepts • use the term in a sentence • describe the relationship of terms • differentiate between terms • use the term in If these behaviors are stated completely as specific statements of learning intent, we can have a number of specific outcomes. To make specifically defined outcomes, the use of active verbs is helpful. If more specificity is desired, statements of condition and criterion level can be added to the learning intents. If you think that the statement, student can differentiate between facts and opinions, needs more specificity, then you might want to add a condition so that it will now sound like this:
Given a short selection, the student can identify statements of facts and of opinions. If more specificity is still desired, you might want to add a statement of criterion level. This time, the statement may sound like this: Given a short selection, the student can correctly identify at least 5 statements of facts and 5 statements of opinion in no more than five minutes without the aid of any resource materials.
The lesson plan may allow the use of moderately specific statements of learning intents, with condition and criterion level briefly stated. In doing assessment, however, these intents will have to be broken down to their substantial details, such that the condition and criterion level are specifically indicated. Note that it is not necessarily about choosing which one statement is better than the other. We can use them in planning for our teaching. Take a look at this:
48
Learning Intent:
Student will differentiate between facts and opinions from written texts.
Assessment:
Given a short selection, the student can correctly identify at least 5 statements of facts and 5 statements of opinion in no more than five minutes without the aid of any resource materials.
If you insert in the text the instructional activities or learning episodes in well described manner as well as the materials needed (plus other entries specified in your context), you can now have a simple lesson plan.
Should the statement of learning intent be stated in terms of teacher performance or student performance that is to be demonstrated after the instruction? How do these two differ from each other? Should it be stated in terms of the learning process or learning outcome? How do these two differ from one another? Should it be subject-matter oriented or competency-oriented?
References: Byrd, P. A. (2002). The revised taxonomy and prospective teachers. Theory into Practice, 41, 4, 244 Ferguson, C. (2002). Using the revised taxonomy to plan and deliver team-taught, integrated, thematic units. Theory into Practice, 41, 4, 238. Marzano, R. J., & Kendall, J. S. (2007). The new taxonomy of educational objectives. 2nd edition. CA: Sage Publications Company. Stiggins & Conklin (1992).
49
Chapter 3 Characteristics of an Assessment Tool Objectives 1. Determine the use of the different ways of establishing an assessment tools’ validity and reliability. 2. Familiarize on the different methods of establishing an assessment tools’ validity and reliability. 3. Assess how good an assessment tool is by determining the index of validity, reliability, item discrimination, and item difficulty. Lessons 1
Reliability Test-retest, split half, parallel forms, internal consistency, inter rater reliability
2
Validity Content, Criterion-related, construct validity, divergent/convergent
3
Item Difficulty and Discrimination Classical test theory approach: item analysis of difficulty and discrimination
4
Using a computer software in analyzing test items
50
Lesson 1 Reliability What makes a good assessment tool? How does one know that a test is good to be used? Educational assessment tools are judged by their ability to provide results that meet the needs of users. For example, a good test provides accurate findings about a students’ achievement if users intend to determine achievement levels. The achievement results should remain stable across different conditions so that they can be used for longer periods of time.
Assessment Tool
Reliable
Valid
Ability to discriminate traits
A good assessment tool should be reliable, valid and be able to discriminate traits. You may have probably encountered several tests that are available in the internet and magazines that tell what kind of personality that you have, your interests, and dispositions. In order to determine these characteristics accurately, the tests offered in the internet and magazines should show you evidence that they are indeed valid or reliable. You need to be critical in selecting what test to use and consider well if these tests are indeed valid and reliable. There are several ways of determining how reliable and valid an assessment tool is depending on the nature of the variable and purpose of the test. These techniques vary from different statistical analysis and this chapter will also provide the procedure in the computation and interpretation. Reliability is the consistency of scores across the conditions of time, forms, test, items and raters. The consistency of results in an assessment tool is determined statistically using the correlation coefficient. You can refer to the section of this chapter to determine how a correlations coefficient is estimated. The types of reliability will be explained in three ways: Conceptually and analytically. Test-retest Reliability Test-retest reliability is the consistency of scores when the same test is retested in another occasion. For example, in order to determine whether a spelling test is reliable, the same spelling test will be administered again to the same students at a different time. If the scores in the spelling test across the two occasions are the same, then the test is reliable. Test-retest is a measure of temporal stability since the test score is tested for consistency across a time gap. The time gap of the two testing conditions can be within a week or a month, generally it does not exceed six months. Test-retest is more appropriate for variables that are stable like psychomotor skills (typing test, block manipulations tests, grip strength), aptitude (spatial, discrimination,
51
visual rotation, syllogism, abstract reasoning, topology, figure ground perception, surface assembly, object assembly), and temperament (extraversion/introversion, thinking/feeling, sensing/intuiting, judging/perceiving). To analyze the test-retest reliability of an assessment tool, the first and second set of scores of a sample of test takers is correlated. The higher the correlation the more reliable the test is.
Procedure for Correlating Scores for the Test-Retest Correlating two variables involves producing a linear relationship of the set of scores. For example a 50 item aptitude test was administered to 10 students at one time. Then it was administered again after two weeks to the same 10 students. The following are the scores produced: Student A B C D E F G H I J
Aptitude Test (Time 1) 45 30 20 15 26 20 35 26 10 27
Aptitude Retest (Time 2) 47 33 25 19 28 23 38 29 15 29
In the following data, ‘student A’ got a score of 45 during the first occasion of the aptitude test and after two weeks got a score of 47 in the same test. For ‘student B,’ a score of 30 was obtained for the first occasion and 33 after two weeks. The same goes for students C, D. E, F, G, H, I, and J. The scores of the test at time 1 and retest at time 2 is plotted in a graph called a scatterplot below. The straight linear line projected is called a regression line. The closer the plots to the regression line, the stronger is the relationship between the test and retest scores. If their relationship is strong, then the test scores are consistent and can be interpreted as reliable. To estimate the strength of the relationship a correlation coefficient needs to be obtained. The correlation coefficient gives information about the magnitude, strength, significance, and variance of the relationship of two variables.
52
Scatterplot of Aptitude Retest (Time 2) against Aptitude Test (Time 1) sample data test retest 2v*10c Aptitude Retest (Time 2) = 5.2727+0.9184*x 50 A 45
Aptitude Retest (Time 2)
40
G
35
B HJ E
30 C F
25 D
20 I
15 10 5
10
15
20
25
30
35
40
45
50
Aptitude Test (Time 1)
Different types of correlation coefficients are used depending on the level of measurement of a variable. Levels of measurement can be nominal, ordinal, interval, and ratio. More information about the levels of measurement is explained at the beginning chapters of any statistics book. Most commonly, assessment data are in the interval scales. For interval and ratio or continuous variables, the statistics that estimates the correlation coefficient is the Pearson Product Moment correlation or the r. The r is computed using the formula: r=
NΣXY − (ΣX )(ΣY ) 2
[ NΣX − (ΣX ) 2 ][ NΣY 2 − (ΣY ) 2 ]
Where r = correlation coefficient N = number of cases (respondents, examinees) ΣXY = summation of the product of X and Y ΣX = summation of the first set of scores designated as X ΣY = summation of the second set of scores designated as Y ΣX2 = sum of squares of the first set of scores ΣY2 = sum of squares of the second set of scores
53
To obtain the parameters of ΣX , ΣY, ΣX2, and ΣY2, a table is set up.
Student A B C D E F G H I J
Aptitude Test (Time 1) X 45 30 20 15 26 20 35 26 10 27 ΣX=254
Aptitude Retest (Time 2) Y 47 33 25 19 28 23 38 29 15 29 ΣY=286
XY 2115 990 500 285 728 460 1330 754 150 783 ΣXY =8095
X2 2025 900 400 225 676 400 1225 676 100 729 ΣX2 =7356
Y2 2209 1089 625 361 784 529 1444 841 225 841 ΣY2 =8948
To obtain a value of 2115 on the 4th column ion XY, simply multiply 45 and 47, 2025 on the 5th column is obtained by squaring 45 (452 or 45 x 45), 2209 on the last column is obtained by squaring 47 (472 or 47 x 47). The same is done for each pair of scores in each row. The values of ΣX , ΣY, ΣX2, and ΣY2 are obtained by adding up or summating the scores from student A to student J. The values are then substituted in the equation for Pearson r. r=
r=
NΣXY − (ΣX )(ΣY ) 2
[ NΣX − (ΣX ) 2 ][ NΣY 2 − (ΣY ) 2 ] 10(8095) − ( 254)( 286) [10(7356) − ( 254) 2 ][10(8948) − ( 286) 2 ]
r = .996 An obtained r value of .996 can be interpreted in four ways: Magnitude, strength, significance, and variance. In terms of its magnitude, by observing the scatterplot the scores project a regression line showing the increase of the aptitude test increases, the retest scores also increases. This magnitude is said to be positive. A positive magnitude indicates that as the X scores increases, the Y scores also increases. In such cases that a correlation coefficient of -.996 is obtained, this indicates a negative relationship where as the X scores increases, the Y scores decreases or vice versa. For strength, as the correlation coefficient reaches 1.0 or -1.00 the stronger is the relationship, the closer it is to “0” the weaker the relationship. A strong relationship indicates that the plots are very close to the projected linear regression line. In the case of the .996 correlation coefficient, it can be said that there is a very strong relationship between the scores of
54
the aptitude test and retest scores. The cut-off can be used as guide to determine the strength of the relationship: Correlation Coefficient Value 0.80 – 1.00 0.6 – 0.79 0.40 – 0.59 0.2 – 0.39 0.00 – 0.19
Interpretation Very high relationship High relationship Substantial/marked relationship Low relationship Negligible relationship
For significance, it tests whether the odds favor a demonstrated relationship between X and Y being real as opposed to being chance. If the relationship favors to be real, then the relationship is said to be significant. Consult a statistics book for a detailed explanation of testing for the significance of r. To test whether a correlation coefficient of .996 is significant it is compared with an r critical value. The critical values for r is found in Appendix A of this book. Assuming that the probability or error is set at alpha level .05 (it means that the probability [p] is less that [<] 5 out of 100 [.05] than the demonstrated relationship is due to chance) (DiLeonardi & Curtis, 1992), and the degrees of freedom is 8 (df=N-1, 8=10-1), a critical value of .632 is attained. A value of .632 is the intersecting value in Appendix A for df=8 and alpha level of .05. Significance is determined when the obtained value is greater than the critical value. In this case, since .996 is greater than .632, then there is a significant relationship between the aptitude test and the retest scores. For the variance, it is interpreted as the amount of overlap between the X and Y. This is interpreted as the “percentage of the time that the variability in X accounts for or explains the variability in Y.” Variance is determined by squaring the correlation coefficient (r2). For the given data set, the variance would be r2=.9962 (would give a variance of .992), in percent, the variance is 99.2 (.992 x 100). To interpret this value, “99.2 percent of the time, the scores during the first aptitude test accounts for or explains the scores during the retest.” Generally a correlation coefficient of .996 indicates that the test scores for aptitude during the test and the retest time is highly reliable or consistent since the value is very strong and significant. A software is provided with this book to help you compute for test retest correlation coefficients and the other techniques for establish reliability and validity. A detailed demonstration on using the software is found at the end of this chapter. Parallel Form or Alternate Form Reliability In this technique two tests are used that are equivalent in the aspects of difficulty, format, number of items, and specific skills measured. The equivalent forms are administered to the same examinee at one occasion and the other in a different occasion. Split half is both a measure of temporal stability and consistency of responses. Since the two tests are administered separately across time it is a measure of temporal stability like the test-retest. But on the second occasion, what is administered is not the exact same test but an equivalent form of the test. Assuming that the two tests are really measuring the same characteristics, then there should be consistency on the scores. Parallel forms can be used in affective and cognitive measures in general as long as there are available forms of the test.
55
Reliability is determined by correlating the scores from the first form and the second form. In most cases, Form A of the test is correlated with Form B of the test. A strong and significant relationship would indicate equivalence and consistency of the two forms. Split-half Reliability In split-half the test is split into two parts and the scores for each part should show consistency. The logic behind splitting the test into two parts is to determine whether the scores within the same test is internally consistent or homogeneous. There are many ways of splitting the test into two halves. One is by randomly distributing the items equally into two halves. Another is separating the odd numbered items with the even numbered items. In doing the split-half reliability, one ensures that the test contains large amount of items so that there will still be several items left for each half. The assumption here is that there should be more items in order for the test to be more reliable. It follows that the more the items in a test, the more it becomes reliable. Spit-half is analyzed by summating first the total scores for each half of the test for each participant. The total scores in pairs are then correlated. A high correlation coefficient would indicate internal consistencies of the responses in the test. Since only half of the test is correlated, a correction formula called the Spearman-Brown (rtt)is used by doubling the length of the test. The formula is: rtt =
2r 1+ r
Where rtt = Spearman-Brown coefficient r = correlation coefficient Suppose that a test that measures aggression with 60 items was split into two halves having 30 items each half and the computed r is .93. The Spearman-Brown coefficient would be .96. Observe that from the correlation coefficient of .93 there was an increase to .96 when converted into Spearman-Brown. Computation of Correlation coefficient to Spearman-Brown: rtt =
2(.93) 1 + .93
rtt = .96
56
Internal Consistency Reliabilities Several techniques can be used to test whether the responses in the items of a test are internally consistent. The Kuder-Richardon, Cronbach’s alpha, interitem correlation, and item total correlation can be used. The Kuder-Richardson (KR #20) is used if the responses in the data are binary. Usually it is used for tests with right or wrong answers where correct responses are coded as “1” and incorrect responses are coded as “0.” The KR#20 formula is: k Σpq 1 − 2 σ k −1
KR 20 =
To determine σ2 (variance) Σx 2 σ = N −1 2
Where k=number of items p=proportion of students with correct answers q=proportion of students with incorrect answers σ2=variance Σx2=sum of squares Suppose that the following data was obtained in a 10 item math test (“1” correct answer “0” incorrect answer) among 10 students: Student
Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
Item 7
Item 8
Item 9
Item 10
A B C D E F G H I J total p q pq
1 1 1 1 1 1 1 1 1 1 10 1 0 0
1 1 1 1 1 1 1 1 1 1 10 1 0 0
1 1 1 1 1 1 1 1 1 1 10 1 0 0
1 1 1 1 1 1 1 0 1 1 9 0.9 0.1 0.09
1 1 1 1 1 1 1 0 0 0 7 0.7 0.3 0.21
1 1 1 1 1 1 1 0 1 0 8 0.8 0.2 0.16
1 1 1 1 1 0 0 1 0 0 6 0.6 0.4 0.24
1 0 0 1 0 0 1 1 0 0 4 0.4 0.6 0.24
1 1 0 0 0 0 0 1 0 0 3 0.3 0.7 0.21
1 1 1 0 0 1 0 0 0 1 5 0.5 0.5 0.25
Variance Computation:
X−X
( X − X )2
Total (X)
10 9 8 8 7 7 7 6 5 5 X =7.2
Σpq=1.4
2.8 1.8 0.8 0.8 -0.2 -0.2 -0.2 -1.2 -2.2 -2.2
7.84 3.24 0.64 0.64 0.04 0.04 0.04 1.44 4.84 4.84 2 x =23.6 σ2=2.62
57
Get the total scores of each examinee (X), then compute for the average of the scores of the ten examinees ( X =7.2). Subtract the mean to each individual total score ( X − X ) then square each of these differences ( X − X )2. Get the summation of these squared difference and this value is the Σ( X − X )2. In the data given the value of ( X − X )2 is 23.6 and N=10. Substitute these values to obtain the variance.
σ2 =
23.6 10 − 1
σ2=2.62 KR20 computation: The variance is now computed (σ2=2.62), the next step is to obain the value of Σpq. This is obtained by summating the total correct responses for each item (total). This total is converted into a proportion (p) by dividing with the total number of cases (N-10). A total of 10 when divided by 10 (N) will have a proportion of 1. Then to determine q which is the proportion incorrect, subtract the proportion correct from 1. If the proportion correct is 0.9 the proportion incorrect will be 0.1. Then pq is determined by multiplying the p and q. Get the summation of the pq and it will yield Σpq which has the value of 1.4. Substitute all the values in the KR 20 formula. KR 20 =
10 1 .4 1 − 10 − 1 2.62
KR20 = 0.52 The internal consistency of the 10 item math test is 0.52 indicating the responses are not highly consistent with each other. The Cronbach’s alpha also determines the internal consistency of the responses of items in the same test. The Cronbach’s alpha can be used for responses that are not limited to binary type such as a five point scale and other response format that are expressed numerically. Usually tests beyond the binary type are affective measures and inventories where there is no right or wrong answers. Suppose that a five item test measuring attitude towards school assignments was administered to five high school. Each item in the questionnaires is answered using 5 point Likert Scale (5=strongly agree, 4=agree, 3=not sure, 2=disagree, 1=strongly disagree). Below are the five items that measures attitude towards school assignments. Each student will select in a likert scale of 1 to 5 how they respond to each of the items. Then their responses are encoded. Items
Strongly
Agree
Not
Disagree
Strongly
58 agree
1. I enjoy doing my assignments. 2. I believe that assignments help me learn the lesson better. 3. I know that assignments are meant to enhance our skills in school. 4. I make it a point to check my notes and books everyday to see if I have assignments. 5. I make sure that I complete all my assignments everyday.
sure
disagree
5 5
4 4
3 3
2 2
1 1
5
4
3
2
1
5
4
3
2
1
5
4
3
2
1
The next table shows how the Cronbach’s alpha is determined given the responses of the five students. In the next table, student A answered ‘5’ for item 1, ‘5’ for item 2, ‘4’ for item 3, ‘4’ for item 4, and ‘1’ for item 5. The same goes for students B, C, D, and E. In computing for Cronbach’s alpha, the variance (σ2) for the students’ scores and the variance (σ2) for the total scores for each item is used instead of the Σpq. Obtaining the variance for the scores of each respondent is the same in Kuder Richardson where the mean of the scores is substracted to each score, then the value is squared and the sum of squares (22.8) is divided by n-1 (5-1=4). Diving the sum of squares (22.9) with the n-1 (4) will give the variance (σ2=5.7). The same procedure is done for obtaining the item variance Σ( SD t2 ) . Get the sum of all scores per item (summate going down for each column in the table below), then obtain the mean of the scores per item ( X item=16.2). The mean is subtracted to each item total (Score –Mean). This difference is then squared (Score –Mean)2. The sum of squares is then obtained Σ(Score – Mean)2. The value (38.8) is divided by n-1 and will give the value of Σ(σ t2 ) which is the variance of the items Σ(σ t2 ) =9.7. The values obtained can now be substituted in the formula for Cronbach’s alpha: 2 2 n σ t − Σ(σ t ) Cronbach' sα = σ t2 n − 1
5 5 .7 − 9 .7 ) Cronbach' sα = 5 − 1 5 .7 Cronbach’s α = .88
The table below shows the values obtained in the procedure.
23.04
21 4.8
item2 5 4 5 4 3
0.64
17 0.8
item4 4 3 3 3 4
10.24
13 -3.2
item5 1 2 3 3 4
38.8 Σ(σ t2 ) = 5 −1 2 Σ(σ t ) = 9.7
2 t
n −1
∑ ( Score − Mean ) )=
case=16.2
Σ(Score-Mean)2=38.8
= item 16.2
Σ(σ
X
X
total for each case (X) 19 15 16 13 18
2
2.8 -1.2 -0.2 -3.2 1.8
Score-Mean
22.8 5 −1
n −1
∑ ( Score − Mean )
σ t2 = 5.7
σ t2 =
σ t2 =
(Score-Mean)2 7.84 1.44 0.04 10.24 3.24
2
Σ(Score-Mean)2=22.8
The internal consistency of the responses in the attitude towards teaching is .88 indicating high internal consistency.
Cronbach’s α = .88
5 5 .7 − 9 .7 ) Cronbach' sα = 5 − 1 5 .7
0.04
16 -0.2
item3 4 3 3 2 4
2 2 n σ t − Σ(σ t ) Cronbach' sα = σ t2 n − 1
4.84
14 -2.2
total for each item Score-Mean
(Score-Mean)2
item1 5 3 2 1 3
Student A B C D E
59
60
Internal consistency is also determined by correlating each combination of items in a test which is known as the interitem correlation. The responses in the items are internally consistent if they yield high correlation coefficients. To demonstrate the interitem correlation among the responses of the five students in their attitude towards assignments, each set of items scores are correlated with each other using the Pearson’s r. This means that item 1 is correlated with item 2, item 3, item 4, and item 5, then, item 2 is correlated with item 3, item 4, and item 5, then item 3 is correlated with item 4 and item 5, then item 4 is correlated with item 5. Such combination will produce a correlation matrix:
Item 1
Item 1 1.00
Item 2 0.24
Item 3 0.85
Item 4 0.74
Item 5 -0.65
Item 2
0.24
1.00
-0.07
-0.22
-0.68
Item 3
0.85
-0.07
1.00
0.87
-0.16
Item 4
0.74
-0.22
0.87
1.00
-0.08
Item 5
-0.65
-0.68
-0.16
-0.08
1.00
Notice that a perfect correlation coefficient is obtained when the item is correlated with itself (1.00). It can also be noted that strong correlation coefficients were obtained between items, 1 and 3, 1 and 4, indicating internal consistencies. Some had negative correlations like between items 1 and 5, and 2 and 5. Interrater Reliability When rating scales are used by judges, the responses can also be tested if they are consistent. The concordance or consistency of the ratings is estimated by computing the Kendall’s ω coefficient of concordance. Suppose that following thesis presentation ratings were obtained from three judges for 5 groups who presented their thesis. The rating scale is in a scale of 1 to 4 where 4 is the highest and 1 is the lowest. Thesis presentation 1 2 3 4 5
Rater 1 4 3 3 3 1
Rater 2 4 2 4 3 1
Rater 3 3 3 4 2 2
Sum of Ratings 11 8 11 8 4
X Ratings =8.4
D 2.6 -0.4 2.6 -0.4 -4.4
D2 6.76 0.16 6.76 0.16 19.36 ΣD2=33.2
The concordance among three raters using the Kendall’s tau is computed by summating the total ratings for each case (thesis presentation). The mean is obtained for the sum of ratings ( X Ratings =8.4). The mean is then subtracted to each of the Sum of Ratings (D). Each difference is
61
squared (D2), then the sum of squares is computed (ΣD2=33.2). these values can now be substituted in the Kendall’s ω formula. In the formula, m is the numbers of raters.
W=
12ΣD 2 m 2 ( N )( N 2 − 1)
W=
12(33.2) 3 (5)(52 − 1) 2
W=0.37 A value of .38 Kendall’s ω coefficient estimates the agreement of the three raters in the 5 thesis presentations. Given this value, there is a moderate concordance among the three raters because the value is not very high.
62
Summary on Reliability Type of Reliability Test-retest
Nature
Measure of:
Use
Statistical Procedure • Correlate the scores from the first test and second test. • The higher the correlation the more reliable • Correlate scores on the first form and scores on the second form
Repeating the identical test on a second occasion
Temporal stability
When variables are stable ex: motor coordination, finger dexterity, aptitude, capacity to learn
Same person is tested with one form on the first occasion and with another equivalent form on the second Two scores are obtained for each person by dividing the test into equivalent halves
Equivalence; Temporal stability and consistency of response
Used for personality and mental ability tests
Internal consistency; Homogeneity items
Used for personality and mental ability tests The test should have many items
• Correlate scores of the odd and even numbered items • Convert the obtained correlation coefficient into a coefficient estimate using Spearman Brown
KuderRichardson Reliability Coefficient Alpha
Computed for binary (e.g., true/false) items The reliability is used to estimate internal consistency of items
Used if there is a correct answer (right or wrong) Used for personality tests with multiple scored-items
• Use KR #20 or KR #21 formula
Inter-item reliability
Correlation of all item combinations
Used for personality tests with multiple scored-items
Each item is correlated with every item in the test
Scorer Reliability
Having a sample of cases independently scored by two raters
Consistency of responses to all items Consistency of responses to all items Homogeneity of items Consistency of responses to all items Homogeneity of items To decrease examiner or scorer variance
Performance assessments Clinical instruments employed in intensive individual tests ex. projective tests
The two scores from the two raters obtained are correlated with each other The Kendalls’s ω is used to estimate concordance of raters
Alternate Form/ Parallel Form
Split-half
of
• Use Cronbach’s formula
the alpha
63
Activity 1:
Test whether the typing test is valid. The following are the scores of 15 participants on a typing test using test-retest reliability. First Test 47 45 43 24 35 45 46 34 34 36 43 21 22 23 24
Retest 30 44 40 28 40 46 46 37 35 35 40 25 23 24 20
64
Activity 2
Administer the “Academic Self-regulation Scale” to atleast 30 students then obtain its internal consistency using split-half, Cronbach’s alpha, and interitem correlation. Self-regulation Scale Instruction: The following items assess your learning and study strategy use. Read each item carefully and RESPOND USING THE SCALE PROVIDED. Encircle the number that corresponds to your answer. 4: Always 3: Often 2: Rarely 1: Never Before answering the items, please recall some typical situations of studying which you have experienced. Kindly encircle the number showing how you practice the following items. MS 1. I make and use flashcards for short answer questions or concepts. MS 2. I make lists of related information by categories MS 3. I rewrite class notes by rearranging the information in my own words. MS 4. I use graphic organizers to put abstract information into a concrete form. MS 5. I represent concepts with symbols such as drawings so I can easily remember them. MS 6. I make a summary of my readings. MS 7. I make outlines as a guide while I am studying. MS 8. I summarize every topic we had in class. MS 9. I visualize words in my mind to recall terms. MS 10. I recite the answers to questions on the topic that I made up. MS 11. I record into a tape the lessons/notes. MS 12. I make sample questions from a topic and answer it. MS 13. I recite my notes while studying for an exam. MS 14. I use post-its to remind me of my homework. MS 15. I make a detailed schedule of my daily activities. GS 16. I make a timetable of all the activities I have to complete. GS 17. I plan the things I have to do in a week. GS 18. I use a planner to keep track on what I am supposed to accomplish. GS 19. I keep track of everything I have to do in a notebook or on a calendar. SE 20. If I am having a difficulty I inquire assistance from an expert. SE 21. I like peer evaluations for every output SE 22. I evaluate my accomplishments at the end of each study session. SE 23. I ask others how my work is before passing it to my professors. SE 24. I take note of my improvements on what I do. SE 25. I monitor my improvement in doing certain task. SE 26. I ask feedback of my performance from someone who is more capable. SE 27. I listen attentively to people who comment on my work. SE 28. I am open to feedbacks to improve my work. SE 29. I browse through my past outputs to see my progress. SE 30. I ask others what changes should be done with my homework, papers, etc. SE 31. I am open to changes based from the feedbacks I received. SA32. I use internet in making my research papers. SA 33. I surf the net to find the information that I need. SA 34. I take my own notes in class. SA 35. I enjoy group works because we help one another. SA 36. I call or text a classmate about the home works that I missed. SA 37. I look for a friend whom I can have an exchange of questions SA 38. I study with a partner to compare notes. SA 39. I explain to my peers what I have learned. ES 40. I avoid watching the television if I have pending homework. ES 41. I isolate myself from unnecessary noisy places. ES 42. I don’t want to hear a single sound while I’m studying.
Always 4 4 4 4 4
Often 3 3 3 3 3
Rarely 2 2 2 2 2
Never 1 1 1 1 1
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
65 ES 43. I can’t study nor do my homework if the room is dark. ES 44. I switch off my TV for me to concentrate on my studies. RS 45. I recheck my homework if I have done it correctly before passing. RS 46. I do things as soon as the teacher gives the task. RS 47. I am concerned with the deadlines set by the teachers. RS 48. I picture in my mind how the test will look like based on previous tests RS 49. I finish all my homework first before doing unnecessary things. OR50. I highlight important concepts and information I found in my readings. OR 51. I make use of highlighters to highlight the important concepts in my reading. OR 52. I put my past notebooks, handouts, and the like in a certain shelf. OR 53. I study at my own pace. OR 54. I fix my things first before I start to study. OR 55. I make sure my study area is clean before studying.
4 4 4 4 4 4 4 4 4 4 4 4 4
3 3 3 3 3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 2 2 2 2 2 2 2
MS: memory strategy GS: Goal Setting SE: Self-evaluation SA: Seeking assistance ES: Environmental Structuring RS: Responsibility OR: Organizing Further Analysis 1. Show the Cronbach’s alpha for each factor and indicate whether the responses are internally consistent. 2. Split the test into two then indicate whether the responses are internally consistent. 3. Intercorrelate each item.
1 1 1 1 1 1 1 1 1 1 1 1 1
66
Lesson 2 Validity
Validity indicates whether an assessment tool is measuring what it intends to measure. Validity estimates indicate whether the latent variable shared by items in a test is in fact the target variable of the test developer. Validity is the ability of a scale or test to predict events, relationship with other measures, and representativeness of item content. Content Validity Content validity is the systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured. For affective measures, it concerns whether the items are enough to manifest the behavior measured. For cognitive tests, it concerns whether the items cover all contents specified in an instruction. Content validity is more appropriate for cognitive tests like achievement tests and teacher made tests. In these types of tests, there is a presence of a specified domain that will be included in the test. The content covered is found in the instructional objectives in the lesson plan, syllabus, table of specifications, and textbooks. Content validity is conducted through consultation with experts. In the process, the objectives of the instruction, table of specifications, and items of the test are shown to the consulting experts. The experts check whether the items are enough to cover the content of the instruction provided, whether the items are measuring the objectives set, and if the items are appropriate for the cognitive skill intended. The process also involves correcting the items if they are appropriately phrased for the level that will take the test and whether the items are relevant to the subject area tested. Details on constructing Table of Specifications are explained in the next chapters. Criterion-Prediction Validity Criterion-prediction involves prediction from the test to any criterion situation over time interval. For example, to assess the predictive validity of an entrance exam, it will be correlated later with the students’ grades after a trimester/semester. The criterion in this case would be the students’ grade which will come in the future. Criterion-prediction is used for hiring job applicants, selecting students for admission to college, assigning military personnel to occupational training programs. For selecting job applicants, the pre-employment tests are correlated with the obtained supervisor rating in the future. In assigning military personnel for training, the aptitude test administered before training will be correlated with the future post assessment in the training. A positive and high correlation coefficients should be obtained in these cases to adequately say that the test has a predictivevalidity. Generally the analysis involves the test score correlated with other criterion measures example are mechanical aptitude and job performance as a machinist.
67
Construct Validity Construct validity is the extent to which the test may be said to measure a theoretical construct or trait. This is usually conducted for measures that are multidimensional or contains several factors. The goal of construct validity is to explain and prove the factors of the measure as it is true with the theory used. There are several methods for analyzing the constructs of a measure. One way is to correlate a new test with a similar earlier test as measured approximately the same general behavior. For example, a newly constructed measure for temperament is correlated with an existing measure of temperament. If high correlations are obtained between the two measures it means that the two test are measuring the same constructs or traits. Another widely used technique to study the factor structure of a test is the factor analysis which can be exploratory of confirmatory. Factor analysis is a mathematical technique that involves arriving with sources of variation among the constructs involved. These variations are usually called factors or components (as explained in chapter 1). Factor analysis reduces the number of variables and it detects the structure in the relationships between variables, or classify variables. A factor is a set of highly intercorrelated variables. In using a Principal Components Analysis as a method of factor analysis, the process involves extracting the possible groups that can be formed through the eigenvalues. A measure of how much variance each successive factor extracts. The first factor is generally more highly correlated with the variables than the second factor. This is to be expected because these factors are extracted successively and will account for less and less variance overall. Factor extraction stops when factors begin to yield low eigenvalues. An example of the extraction showing eigenvalues is illustrated below in the study by Magno (2008) where he developed a scale measuring parental closeness with 49 items and four factors are hypothesized (bonding, support, communication, interaction). Plot of Eigenvalues 16 15 14 13 12 11 10 Value
9 8 7 6 5 4 3 2 1 0 Number of Eigenvalues
68
The scree plot shows that 13 factors can be used to classify the 49 items. The number of factors is determined by counting the eigenvalues that are greater than 1.00. But having 13 factors is not good because it does not further reduce the variables. One technique in the scree test is to assess the place where the smooth decrease of eigenvalues appears to level off to the right of the plot. To the right of this point, presumably, one finds only "factorial scree" - "scree" is the geological term referring to the debris which collects on the lower part of a rocky slope. In applying this technique, the fourth eigenvalue shows a smooth decrease in the graph. Therefore, four factors can be considered in the test. The items that will belong under each factor is determined by assessing the factor loadings of each item. Each item in the process will load in each factor extracted. The item that highly loaded in a factor will technically belong to that factor because it is highly correlated with the other items in that factor or group. A factor loading of .30 means that the item contributes meaningfully to the factor. A factor loading of .40 means the item is highly contributions to the factor. An example of a table with factor loading is illustrated below.
item1 item2 item3 item4 item5 item6 item7 item8 item9 item10
1 0.032 0.13 0.129 0.373 0.621 0.216 0.093 0.111 0.228 0.543
2 0.196 0.094 0.789 0.352 -0.042 -0.059 0.288 0.764 0.315 0.113
3 0.172 0.315 0.175 0.35 0.251 0.067 0.307 0.113 0.144 0.306
4 0.696 0.375 0.068 0.042 0.249 0.782 0.477 0.085 0.321 -0.01
In the table above, the items that highly loaded to a factor should have a loading of .40 and above. For example, item 1 highly loaded on factor 4 with a factor loading of .696 as compared with the other loadings .032, .196, and 0.172 for factors 1, 2, and 3 respectively. This means that item 1 will be classified under factor 4 together with item 6 and item 7 because they all highly load under the fourth factor. Factor loadings are best assessed when the items are rotated (Consult scaling theory references for details on factor rotation). Another way of proving the factor structure of a construct is through Confirmatory Factor Analysis (CFA). In this technique, there is a developed and specific hypothesis about the factorial structure of a battery of attributes. The hypothesis concerns the number of common factors, their pattern of intercorrelation, and pattern of common factor weights. It is used to indicate how well a set of data fits the hypothesized structure. The CFA is done as follow-up to a standard factor analysis. In the analysis, the parameters of the model is estimated, and the goodness of fit of the solution to the data is evaluated. For example, in the study of Magno (2008) confirmed the factor structure of parental closeness (bonding, support, communication, succorance) after a series of principal components analysis. The parameter estimates and the goodness of fit of the measurement model was then analyzed.
69
Figure 1 Measurement Model of Parental Closeness using Confirmatory Factor Analysis
The model estimates in the CFA shows that all the factors of parental closeness have significant parameters (8.69*, 5.08*, 5.04*, 1.04*). The delta errors are used (28.83*, 18.02*, 18.08*, 2.58*), and each factor has a significant estimate as well. Having a good fit reflects on having all factor structures as significant for the construct parental closeness. The goodness of fit using chi-square is a rather good fit (χ2=50.11, df=2). The goodness of fit based on the Root Mean square standardized residual (RMS=0.072) shows that there is less error having a value close to .01. Using Noncentrality fit indeces, the values show that the four factor solution has a good fit for parental closeness (McDonald Noncentrality Index=0.910, Population Gamma Index=0.914). Confirmatory Factor Analysis can also be used to assess the best factor structure of a construct. For example, the study of Magno, Tangco, and Sy (2007), the assessed the factor structure of metacognition (awareness of one’s learning) on its effect on critical thinking (measured by the Watson Glaser Critical Thinking Appraisal). Two factor structured of metacognition was assessed. The first model of metacognition includes two factors which is regulation of cognition and knowledge of cognition (see Schraw and Dennison, ). The second model tested metacognition with eight factors: Declarative knowledge, procedural knowledge, conditional knowledge, planning, information management, monitoring, debugging strategy, and evaluation of learning.
70
Model 1. Two Factors of Metacognition
Model 2: Eight Factors of Metacognition
71
The results in the analysis using CFA showed that model 1 has a better fit as compared to model 2. This indicates that metacogmition is better viewed with two factors (knowledge of cognition and regulation of cognition) that with eight factors. The Principal Components Analysis and Confirmatory Factor Analysis can be conducted using available statistical softwares such as Statistica and SPSS. Convergent and Divergent Validity According to Anastasi and Urbina (2002), the method of convergent and divergent validity is used to prove the correlation of variables with which it should theoretically correlate (convergent) and also it does not correlate with variables from which it should differ (divergent). In convergent validity, constructs that are intercorrelated should be high and positive as explained in the theory. For example, in the study of Magno (2008) on parental closeness, when the factors of parental closeness were intercorrelated (bonding, support, communication, and sucorrance), a positive magnitude was obtained indicating convergence of these constructs. Factors of Parental Closeness (1) Bonding (2) Communication (3) Support (4) Succorance
(1) 1.00
(2) 0.70** 1.00
(3) 0.62** 0.57** 1.00
(4) 0.44** 0.28** 0.59** 1.00
**p<.05 For divergent validity, a construct should inversely correlate with its opposite factors. For example, the study by Magno, Lynn, Lee, and Kho (in press) constructed a scale that measures Mothers’ involvement on their grade school and high school child. The factors of mothers involvement in school-related activities are intercorrelated. Observe that these factors belong in the same test but controlling was negatively related permissive and loving is negatively related with autonomy. This indicates divergence of the factors within the same measure. Factors of Mother’s Involvement Controlling Permissive Loving Autonomy
Controlling
Permissive
Loving
Autonomy
---0.05 0.05 0.14*
--0.17* 0.41*
---0.36*
---
72
Summary on Validity Type of Validity Content Validity
Nature Systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured.
Use More appropriate for achievement tests & teacher made tests
Criterion-Prediction Validity
Prediction from the test to any criterion situation over time interval
Construct Validity
The extent to which the test may be said to measure a theoretical construct or trait.
Hiring job applicants, selecting students for admission to college, assigning military personnel to occupational training programs Used for personality tests. Measures that are multidimensional
Convergent Validity
The test should correlate significantly from variables it is related to The test should not correlate significantly from variables from which it should differ
Divergent Validity
(Statistical) Procedure • Items are based on instructional objectives, course syllabi & textbooks • Consultation with experts • Making testspecifications Test scores are correlated with other criterion measures ex: mechanical aptitude and job performance as a machinist
Commonly for personality measures
• Correlate a new test with a similar earlier test as measured approximately the same general behavior • Factor analysis • Comparison of the upper and lower group • Point-biserial correlation (pass and fail with total test score) • Correlate subtest with the entire test Multitrait-multidimensional matrix
Commonly for personality measures
Multitrait-multidimensional matrix
73
EMPIRICAL REPORT The Development of the Self-disclosure Scale Carlo Magno Sherwin Cuason Christine Figueroa De La Salle University-Manila Abstract The purpose of the present study is to develop a measure for self-disclosure. The items were based on a survey administered to 83 college students. From the survey 114 items were constructed under 9 hypothesized factors. The items were reviewed by experts. The main try out form of the test was composed of 112 items administered to 100 high school and college students. The data analysis showed that the test has a Cronbach’s alpha of .91. The factor loadings retained 60 items with high summated correlations under five factors. The new factors are beliefs, relationships, personal matters, interests, and intimate feelings.
Each person has a complex personality system. Individuals are oftentimes very much interested in knowing our personality type, attitudes, interests, aptitude, achievement and intelligence. This is the reason why we should develop a psychological test that would help us assess our standing. The test we have developed aims to measure the self-disclosing frequency individuals in different areas. This will help them know what areas in their life they are willing to let other people know. This would be a good instrument for counselors to use for the assessment of their clients. The result of the client’s test would help the counselor adjust his or her skills eliciting or disclosing more or other areas or other topics. Self-disclosure is a very important aspect in the counseling process, because selfdisclosure is one of the instruments the counselor can use. The consequence of the client not disclosing himself is their inability to respond to their problem and to the counselor. This is what the researchers took into consideration in developing the test. It could also be used outside the counseling process. An individual may want to take it to find out what
areas in his or her life have been easy for them to shell out and what areas need more revelations. It has always been psychologists concern to explain what is going on inside a particular individual in relation with his entire system of personality. One important component of looking into the intrinsic phenomenon of human behavior is self-disclosure. Selfdisclosure as defined by Sidney Jourard (1958) is the process of making the self known to other person; “target persons” are persons whom information about the self is communicated. In the process of self-disclosure we make ourselves manifest in thinking and feeling through our actions - actions expressed verbally (Chelune, Skiffington, & Williams, 1981). In addition, Hartley (1993) stressed the importance of interpersonal communication in disclosing the self. Hartley (1993) defined self-disclosure as the means of opening up about oneself with other people. Moreover, Norrel (1989) defined selfdisclosure as the process by which persons make themselves known to each other and occur when an individual communicates genuine thoughts and feelings. Generally, self-disclosure is the process in which a person is willing to share or open oneself to another person or group whom the individual can trust and the process is done verbally. The factors identified in self-disclosure which are potent areas in the content in communicating superficial or intimate topics are (1) Personal matters, (2) Thoughts & ideas, (3) Religion, (4) Work, study & accomplishments, (5) Sex, (6) Interpersonal relationship, (7) Emotional state, (8) tastes, (9) Problems. The process of self-disclosure occurs during interaction with others (Chelume, Skiffington, & Williams, 1981). In the studies that Jourard (1961;1969) conducted, he stated that a person will permit himself to be known when “ he believes his audience is man of goodwill.” There should be a guarantee of privacy that the
74
information disclosed will not escape the circle. Jourard (1971) noted that persons need to self-disclose to get in touch with their real selves, to have intimate relationships with people, to bond with others, in pursuit of the truth of one’s being and to direct their destiny on the basis of knowledge. Jourard agrees with Buber (1965) that in a humanistic sense of selfdisclosure “we see the index of man functioning at his highest and truly human level rather than at the level of a thing or an animal. “ The consequences that follow after selfdisclosure are manifested on its outcomes (Jourard, 1971). The outcomes are: (1) We learn the extent to which we are similar, one to the other, and to the extent to which we differ from one another in thoughts, feelings, hopes and reactions to the past. (2) We learn of the other man’s needs, enabling them to help him or to ensure that his needs will not be met. (3) We learn the extent to which a man accords with or deviates from moral and ethical standards. In a survey that the researchers have conducted, a person after disclosing feels better (42.2%), happy (8.26%), free (5.51%), fine (4.6%), relaxed (3.67%), peaceful (3.67%), okay (3.67%), lighter (2.75%), calm (2.75%), great (1.83%), satisfied (1.83%), nothing (6.42%), and others (12.88%). Furthermore, it was reported that on being transparent or open, individuals feel relieved that a burden was taken off their shoulders, they experience peace of mind, and consequently happiness, contact with his or her real self, and better able to direct their destiny on the basis of knowledge (Jourard, 1971; Maningas, 1993). Cozby (1973) noted that self-disclosure as an ongoing behavioral process include five basic parameters: amount of personal information disclosed; intimacy of the information disclosed; rate or duration of disclosure; affective manner of presentation; and disclosing flexibility, these are the appropriate cross-situational modulation of disclosure. Cozby (1973) further stated that interrelatedness on these parameters
is used interchangeably. Areas of Self-disclosure In terms of the information disclosed, the researchers arrived with nine hypothesized factors based on a survey study conducted. These factors are: Interpersonal relationship, thoughts and ideas, work/study/accomplishments, sex, religion, personal characteristics, emotional state, tastes, problems. The factors are reflected on the subjects disposition of being students in which there are influences of social situation of schooling and social life. Interpersonal Relationship. Interpersonal relationship is operationally defined as the range of relationships or bonding formed within the outside the family which include peers, friends, and casual acquaintances. Jourard (1971) proposed that disclosure of relatively intimate information indicates movement towards greater intimacy in interpersonal relationships. In support, it is indicated that self-disclosure illuminate the process of developing relationships (Hill & Stull, 1981; Altman & Taylor, 1973). In terms of gender, it was consistently proven that women disclose themselves to their same gender to the greater extent that men do. Females have generally been reported to be more disclosing than males (Jourard, 1971; Chelume et al, 1981; Taylor et al, 1981). Some studies indicate that individuals who are more willing to disclose personal information about themselves also to high-disclosing rather than low disclosing others (Jourard, 1959; Jourard & Landsman, 1960; Richman, 1963; Altman & Taylor, 1973). It was reported that self-disclosure is significantly and positively related with friendship and this relationship is greatest with respect to intimate topics or superficial information (Rubin & Levy, 1975; Newcomb, 1961; Priest & Sawyer, 1967). Rubin and Shenker (1975) adapted a self-disclosure questionnaire of Jourard and Taylor (1971) in which they came up with four
75
new clusters; interpersonal relationship, attitudes, sex, and tastes. These clusters contain items on sensitive information one withholds. The selfdisclosure reports are less moderately reliable (.62 to .72 for men and .51 to .78 for women). In marital relationship, it was found that the greater the discrepancy in partners affective self-disclosure and marital satisfaction ( Levinger & Senn, 1967; Jorgensen, 1980). In parent-child relationship it was reported that there are no differences in the content of the self-disclosure of Filipino adolescents with their mother and father (Cruz, Custodio, & Del Fierro, 1996). The study also indicated that birth order is highly relevant in analyzing the content of self-disclosure. The result of the study also show that children are more disclosing toward the mother because she empathize. Sex. One of the most intimate topics as a content in self-disclosure is sex. It is usually embarassing and hard to open to others because some people have the faulty learning that it is evil, lustful, and dirty (Coleman, Butcher, & Carson, 1980). But mature individuals view human sexuality as a way of being in the world of men and women whose moments of life and every aspect of living is spent to experience being with the entire world in a distinctly male or female way (Maningas, 1995). Furthermore, sexuality is part of our natural power or capacity to relate to others. It gives the necessary qualities of sensitivity, warmth, mental respect in our interpersonal relationship and openness (Maningas, 1995). Sexuality as being part of our relationship needs to be opened up or expressed as Freud noted the desire of our instinct or id. Maningas (1995) stressed out that sex is an integral part of our personal self-expression and our mission of self-communication to others. Some findings by Jourard (1964) on subject matter differences noted that details about one’s sex life is not muchly disclosable as compared with other factors. Jourard (1964) also noted that anyone who is reluctant to be known by another
person and to know another person - sexually and cognitively - will find the prospective terrifying. Sex as a factor in self-disclosure is included because most closely knitted adolescents gives focal view on sex. The survey study that was conducted shows that 5.26% of males and 3.44% of females disclose themselves regarding sexual matters. Personal matters about the self. Personal matters consist of private truths about oneself and it may be favorable or unfavorable evaluative reaction toward something or someone, exhibited in one’s belief, feelings or intended behavior. In an experiment conducted by Taylor, Gould, and Brounstein (1981), they found that the level of intimacy of the disclosure was determined by (1) dispositional characteristics, (2) characteristics of subjects, and (3) the situation. Their personalistic hypothesis was confirmed that the level of disclosure affects the level of intimacy. Some studies also show that some individuals are more willing to disclose personal information about themselves to high disclosing rather than low disclosing others (Jourard, 1959; Jourard & Landsman, 1960; Jourard & Richman, 1963; Altman & Taylor, 1973). Furthermore, Jones & Archer (1976) have sought directly that the recipient’s attraction towards a discloser would be mediated by the personalistic attribution the recipient makes for the disclosers level of intimacy. Kelly and McKillop (1996) in their article stated that “choosing to reveal personal secrets is a complex decision that could have distorting consequences, such as being rejected and alienated from the listener.” But Jourard (1971) noted that a healthy behavior feels “right” and it should produce growth and integrity. Thus, disclosing personal matters about oneself is a means of being honest and seeking others to understand you better. Emotional State. One of the factors of self-disclosure defined as one’s revelation of
76
emotions or feelings to other people. A retrospective study was conducted to determine what students did to make their developing romantic relationship known to social network members and what they did to keep their relationship from becoming known. It is shown in this study that the most frequent reasons for revelation were felt obligation to reveal based on the relationship with the target, the desire for emotional expression, and the desire for psychological support from the target. The most frequent reason to withhold information was the anticipation of a negative reaction from the target (Baxter, 1993). The researchers felt that the determination of the probability of self-disclosure will be a lot better if emotional state is considered as a factor. Emotions, disclosures & health addresses some of the basic issues of psychology and psychotherapy: how people respond to emotional upheavals, why they respond the way they do, and why translating emotional events into language increases physical and mental health (Pennebaker, 1995). Taste. Is defined as thelikes and dislikes of a person openned to other people. In a study made by Rubin & Shenker (1975), they made a test studying the friendship, proximity and self-disclosure of college students in the contexts of being roomates or hallmates. The items were categorized in four clusters, in what we thought would be ascending order of intimacy-tastes, attitudes, interpersonal relationships, and self-concept and sex. This would help us determine whether people are willing to share superficial information right away as well as intimate information. Thoughts. Is defined as the things in mind that one is willing to share with other people. “A friend”, Emerson wrote, “ is a person with whom I may be sincere. Before him I may think aloud.” A large number of studies have documented the link between friendship and the disclosure of personal thoughts and feelings that Emerson’s statement implies (Rubin & Shenker, 1975). Another study presents a self-
psychological rationale for the selected use of therapist self-disclosure, the conscious sharing of thoughts, feelings, attitudes, or experiences with a patient (Goldstein, 1994). Religion. We operationally defined religion in self-disclosure as the ability of an individual to share his experiences thoughts, and emotions toward his beliefs about God. Healey (1990) offer an overview of the role of selfdisclosure in Judeo-Christian religious experience with emphasis in the process of spiritual direction. In the study done by Kroger (1994), he shows the catholic confession as the embodiment of common sense regarding the social management of personal secrets, of the sins committed, and considers confession as a model for understanding the problem of the social transmission of personal secrets in everyday life. It is very important and considered as a factor in self-disclosure because of the fact, the Filipino people are very religious, and study shows that religious people disclose more ( Kroger, 1994). Problem When a person is depressed, he tends to find others that will listen and can share the problem with. To release the tension a person feels, he usually discloses it. Larity of a problem is attained when people starts to verbalize it and in the process, can be reach a solution. In the study of Rime (1995), they revealed that after major negative life events and traumatic emotional episodes, ordinary emotions, too, are commonly accompanied by intrusive memories and the need to talk about the episode. It also considered the hypothesis that such mental rumination and social sharing would represent spontaneously initiated ways of processing emotional information. Work/Study. Work or study is defined as the person’s present duty or responsibility which is expected to him and needs to be fulfilled in a given time. It is considered a factor in selfdisclosure because this will give a glimpse of how open a person can share his joy and burden
77
in his current responsibility. In the study of Starr (1975), it was hypothesized that self- disclosure is causally related to psychological and physical well being, with low disclosure related to maladjustment and high disclosure associated with mental health. Table 1 Hypothesized factors of Self-disclosure Factor Emotional state
Interpersonal relationship
Personal matters
Problems
Religion
Sex
Taste
Thoughts
Work/study/ accomplishment
Definition One’s revelation of emotions or feelings to other people. Feelings, attitudes toward a situation being revealed to others. Indicates movement towards greater intimacy in interpersonal relationships. Range of relationships or bonding formed within the outside the family. Private truth about oneself, favorable or unfavorable, toward something or someone and is exhibited in one’s belief, feelings or intended behavior. Being honest and seeking others to know you better by disclosing. Depressing event or situation that can be lightened through disclosing. Conflict, disagreement experienced by an individual. Ability of an individual to share his experiences, thoughts and emotions toward his feeling of God. Concept, perception and view of religion by an individual being able to share or tackle in the face of others. As a way of being in the world of men and women whose moments of life is spent to experience being with the entire world in a distinctly male or female way. Willingness of a person to discuss his sexual experiences, needs and views. Likes and dislikes of a person opened to other people. Views, feeling, appreciation of a person, place or thing. Information in mind that you are willing to share with other people. Perception regarding a thing, or situation which is shared with others. Person’s present duty in which is expected to him. A person’s responsibility being expected by others and to be fulfilled in a particular time.
Method Search for Content Domain In the search for content domains, a survey was made and answered by 55 females from 16-22 years old. The respondents were students from the CLA, COE, COS and CBE of DLSU. The survey questionnaire aims to gather data about the self-disclosing activities of the students. The survey questionnaire indicates the person whom one usually discloses, topics disclosed, situation where one discloses, how one discloses, characteristics while disclosing, and rate of their own self- disclosing habit. The self- disclosure questionnaire by Sidney Jourard and Rubin and Shenkers intimacy of selfdisclosure was reviewed on how they came up with their items and factors. Item Writing and Review Based from the survey, 114 items under nine factors were constructed and the verbal frequency scale was used. The items were reviewed by two psychology professors and one psychometrician from De La Salle University. Some items were deleted, some were removed, and some were added. After being reviewed the pre-try out form was constructed. Development of the Pretest Form The pretest-form consists of 114 items with nine factors. The factors were sex (5 items), problems (21 items), interpersonal relationship (17 items), accomplishments/work/study (14 items), religion (6 items), tastes (8 items), thoughts (9 items), personal matter (20 items). The scaling used is the verbal frequency scale Always, often, sometimes, rarely, never). Pretryout Form In the pretryout form, 10 forms were prepared to be answered by 10 respondents conveniently selected, then a feedback is given on vague and not applicable items, and other comments. There were 10 psychology majors who answered the pre-test form (6 females and 4 males).
78
The pretryout form consists of 110 items still with nine factors. There were six negative items (item no. 7, 30, 97, 106, 107, 109) and the rest were positive items. The scaling used was the verbal frequency scale because the test is a measure of a habit. The order of the items were randomly arranged and the responses are answered by checking the corresponding scale. The purpose of the pretryout form is for mild testing 10 subjects and to ask for comments for further revision.
factors and the reliability was obtained using the Cronbach’s alpha. The items were grouped using Principal Components Analysis.
Development of the Main-tryout form The comments made on the pretryout form were considered and the main-tryout form was developed. The main-tryout form was consists of 112 items. The test was intended for adolescents because the items were empirically based on adolescent subjects and it reflects their usual activities. There were six negative items. The scaling used was the verbal frequency scale. The arrangement of items were in random order and the task of the respondent is to check the corresponding scale beside each item. There were 100 respondents who answered the test. The respondents were fourth year highschool students of St. Augustine School in Cavite, their ages ranging from 15 to 16, there were 48 males and females. The rest of the participants were college students from De La Salle University. The sampling design is purposive in which the respondent’s selection criteria is should belong to fourth year level in highschool and in college in private schools. During the administration of the test, the researchers explained the purpose of the test to the students and they all agreed to answer. It took the respondents 20 minutes to answer the test. The researchers then reviewed the data after the collection, each test was scored and encoded in the computer.
Plan on developing the Norms A norm will be used to interpret the scores. The test is scored based on the corresponding answer on each item. A score is yield for a particular factor. The raw score will have an equivalent percentile based on a norm. And a corresponding percentile will have a remark.
Item Analysis and Factor Analysis The 112 items were intercorrelated and the factors were extracted the SPSS computer software. A matrix was made between the
Development of the Final form In the final form there were 60 items accepted and 62 items were deleted in the item analysis due to low factors loadings (below .40). There were five factors extracted in the Principal Components Analysis: Beliefs, relationships, personal matters, interests, and intimate feelings.
Test Plan In administering the test there is no alloted time to answer the test. The respondents or person taking the test is instructed to shade their corresponding answer on the answer sheet. There is no right or wrong answer in the test so respondents should answer as honestly as possible. In scoring, the answer Always is equivalent to 5 points, often=4, sometimes=3, rarely=2, never=1. All the items are positive because all the negative items were removed during the item analysis due to low factor loadings. The score on each item will be summated and there is an equivalent percentile for a particular score. In the interpretation, the garnered percentile will have a remark of high frequency, average frequency, and low frequency. A low disclosing individual would mean that the particular person never or rarely opens up his or herself toward others in the particular area. An average self-disclosing individual would mean that the particular person have opened in general terms about a particular matter
79
only when necessary and on selected others on a particular area. A high self-disclosing individual would mean that the person has opened and shared himself fully and in complete details to others in the particular area. The individual will have the tendency to let himself to be known in all dimensions of his or her being. Results The corrected item-total correlation of the 62 items have a total correlation of above .30. The item-total correlation of accepted items ranges from .4866 to .3009, the item correlation of the deleted items ranges from -.0123 to .2980. The coefficient alpha reliability is .9134, the standard item alpha is .9166. A correlation matrix was made on the 112 items, the mean for the interiitem correlation is .339, the variance is 1821.3782, and the standard deviation is 42.6776. The highest intercorrelation of items is .6543 that occurred between item number 51 and item number 74. In the process of factor analysis, the hypothesized nine factors were extracted into 18 factors with an eigenvalue of 1.07878. The researchers considered 4% of variance which offers 5 factors. Table 2 shows the accepted items with their factor loadings.
Table 2 Accepted items with their factor loadings Item number item 33 item 70 item 8 item 3 item 20 item 98 item 77 item 52 item 59 item 101 item 18 item 88 item93 item95 item65 item94 item53 item75 item68 item66 item76 item96 item99 item11 item111 item82 item83 item17 item10 item104 item100 item56 item60 item62 item69 item27 item34 item39 item78 item43 item01 item28 item26 item35 item32 item72 item06 item73 item10 item100 item104 Item17 item56 item60 item62 item69 item82 item83
Factor 1 .68766 .64815 .61846 .59245 .55228 .53677 .45061 .45001 .40504 .38157 .32574 Factor 2 .64293 .72024 .6780 .59372 .54047 .51697 .50285 .44658 .41453 .41102 .38957 .36690 .36482 Factor 3 .29875 .77164 .69717 .59079 .54027 .45128 .44697 .42587 .39554 .39486 .32917 .63290 .61822 .58744 Factor 4 .54582 .49976 .49312 .43613 .43205 .42141 .41807 .41475 .35834 .32098 Factor 5 .54207 .446979 .54207 .59079 .42587 .39554 .38486 .32917 .77164 .69717
80
Table 3 Factor Transformation Matrix
FACTOR 1 FACTOR 2 FACTOR 3 FACTOR 4 FACTOR 5
FACTOR 1
FACTOR 2
FACTOR 3
FACTOR 4
FACTOR 5
.48
-.56
-.25
-.001
-.61
.45
.43
-.56
-.49
.19
.45
.55
.57
.09
-.39
.4
-.42
.49
-.34
.52
.41
.02
-.21
.79
.39
The new five factors were given new names because the contents were different. Factor 1 was labeled as Beliefs with 11 items, Factor 2 was labeled as relationships with 13 items, Factor 3 labeled as Personal Matters with 13 items, and Factor 4 as intimate feelings with 13 items, and factor 5 labeled as interests with 10 items. Table 4 New Table of Specification Number of items
ITEM NUMBER
RELIABILITY
Factor 1: Beliefs
11
8,101,18, 20,33, 52, 59, 70, 77, 98, 3
.8031
Factor 2: Relationships
13
105, 15, 21, 24, 31, 41, 48, 55, 61, 63 79, 84, 88
.7696
FACTORS
Factor 3: Personal Matters
13
11, 111, 53, 65, 66, 68, 75, 76, 93, 94, 95, 96, 99
.7962
Factor 4: Intimate Feelings
13
1,6, 26, 27, 28, 32, 34, 35, 39, 43, 72, 73, 78
.7922
Factor 5: Interests
10
10, 100, 104, 17, 56, 60, 62, 69, 82, 83
.7979
60
Discussion At first there were nine hypothesized factor based on a survey, 18 factors were then extracted with eigenvalues greater than 1.00. Finally there were a final of five factors with acceptable factor loadings. The five factors have new labels because the items were rotated differently based on the data on the main tryout. Factor 1 contains items about the beliefs on religion, and ideas on a particular topic and it is labeled as such. Factor 2 contains items reflecting relationships with friends and it was labeled as “relationships.” Factor 3 contains items about a person’s secrets and attitudes and most of the items contains personal matters and it was labeled as such. Factor 4 is a cluster of taste and perceptions so it was labeled as interest. Factor 5 contains feelings about oneself, problems, love, success, and frustrations, so it was labeled as intimate feelings. The factors were reliable due to their alpha which are .8031, .7696, .7962, .7922, .7979. It only shows that each factor is consistent with the intended purpose of the researchers. In the result of factor analysis the items were not equal in each factor, factor 1 has 11 items, factor 2 has 13 items, factor 3 has 13 items, factor 4 has 10 items and factor 5 has 13 items. The five factors account for the areas in which a particular individual self-discloses. There were nine hypothesized factors, all of these were disproved, new factors arrived after factor analysis. The items were reclassified in every factor and it was given a new name. Only five factors were accepted following the four percent rating of the eigenvalue. These factors are Beliefs, Relationships, Interests, Personal matters, and intimate feelings. The test we have developed intended to measure the degree of self-disclosure of individuals but it was refocused to measure the self-disclosure each person makes on each different areas or factors. In terms of the test’s psychometric property, it has gone in the level of item review by experts and factor analysis, it has an internal
81
consistency of .9134 which is high. Considering that the test has just undergone its initial stages, further validation study is recommended to give more detailed properties of the test. Norming and interpretation for the test is not yet further established where it needs to be administered to a large sample size. An intensive study should be made with considerable and appropriate number of respondents. In terms of the sampling a probabilistic technique is suggested to account for further generalization in the study because the current test only used a purposive nonprobabilistic sampling.
self - disclosure in psychotherapy. In Stricker, G. & Fisher, M. (eds.) Self-disclosure in the therapeutic relationship (pp. 17-27). New York, NY, US: Plenum Press. Hill, C. T. & Stull, D. E. (1981). Sex differences in effects of social and value similarity in same-sex friendship. Journal of Personality and Social Psychology, 41(3), 488-502. Jones, E. E., & Archer, R. L. (1976). Are there special effects of personalistic self - disclosure? Journal of Experimental Social Psychology, 12(2), 180-193.
References Altman, I., & Taylor, D. A. (1973). Social penetration: The development of interpersonal relationships. New York: Holt, Rinehart & Winston. Baxter, D. E. (1993). Empathy: Its role in nursing burnout. Dissertation Abstracts International, 53, 4026. Chelune, G. J., Skiffington, S, & Williams, C. (1981). Multidimensional analysis of observers' perceptions of self - disclosing behavior. Journal of Personality and Social Psychology, 41(3), 599606. Coleman, C., Butcher, A. & Carson, C. (1980). Abnormal psychology and modern life (6th ed.). New York: JMC. Cozby, P. C. (1973). Self - disclosure: A literature review. Psychological Bulletin, 79(2), 73-91. Goldstein, J. H. (1994). Toys, play, and child development. New York, NY, US: Cambridge University Press. Hartley, P. (1993). Interpersonal communication. Florence, KY, US: Taylor & Frances/Routledge. Healey, B. J. (1990). Self - disclosure in religious spiritual direction: Antecedents and parallels to
Jorgensen, S. R. (1980). Contraceptive attitude behavior consistency in adolescence. Population & Environment: Behavioral & Social Issues, 3(2), 174-194. Jourard, S. M (1970). Experimenter - subject "distance" and self - disclosure. Journal of Personality and Social Psychology, 15(3), 278282. Jourard, S. M. & Jaffe, P. E. (1970). Influence of an interviewer's disclosure on the self - disclosing behavior of interviewees. Journal of Counseling Psychology, 17(3), 252-257. Jourard, S. M. & Landsman, M. J. (1960). Cognition, cathexis, and the dyadic effect in men's self-disclosing behavior. Merrill-Palmer Quarterly, 6, 178-185. Jourard, S. M. & Rubin, J. E. (1968). self disclosure and touching: a study of two modes of interpersonal encounter and their inter - relation. Journal of Humanistic Psychology, 8(1), 39-48. Jourard, S. M. (1959). Healthy personality and self-disclosure. Mental Hygiene, 43, 499-507. Jourard, S. M. (196). Religious denomination and self - disclosure. Psychological Reports, 8, 446.
82
Jourard, S. M. (1961). Self-disclosure patterns in British and American college females. Journal of Social Psychology, 54, 315-320. Jourard, S. M. (1961). Self-disclosure scores and grades in nursing college. Journal of Applied Psychology, 45(4), 244-247. Jourard, S. M. (1964). The transparent self. Princeton: Van Nostrand, 1964. Jourard, S. M. (1968). You are being watched. PsycCRITIQUES, 14(3), 174-176. Jourard, S. M. (1970), The beginnings of selfdisclosure. Voices: the Art & Science of Psychotherapy, 6(1), 42-51. Jourard, S. M. (1971). Self - disclosure: An experimental analysis of the transparent self. Oxford, England: John Wiley. Jourard, S. M., & Landsman, M. J. (1960). Cognition, cathexis, and the "dyadic effect" in men's self-disclosing behavior. Merrill-Palmer Quarterly, 6, 178-186. Jourard, S. M., & Resnick, J. L. (1970). Some effects of self - disclosure among college women. Journal of Humanistic Psychology, 10(1), 84-93. Jourard, S. M., & Richman, P. (1963). Disclosure output and input in college students. MerrillPalmer Quarterly, 9, 141-148.
13(3), 237-249. Maningas, I. (1995). Moral theology. Manila: DLSU Press. Newcomb, T. M. (1981). The acquaintance process. Oxford, England: Holt, Rinehart & Winston. Pennebaker, J. W. (1995). Emotion, disclosure, and health: An overview. Emotion, Disclosure, & Health, 14, 3-10. Priest, R. F. & Sawyer, J. (1967). Proximity and peership: bases of balance in interpersonal attraction. American Journal of Sociology, 72(6), 633-649. Richman, S. (1963). Because experience can't be taught. New York State Education, 50(6), 1820. Rimé, B. (1995). The social sharing of emotion as a source for the social knowledge of emotion. In Russell, J. A., Fernández-Dols, J., Manstead, A., & Wellenkamp, J. C. (eds). Everyday conceptions of emotion: An introduction to the psychology, anthropology and linguistics of emotion (pp. 475-489). NATO ASI series D: Behavioural and social sciences, Vol. 81. New York, NY, US: Kluwer Academic/Plenum Publishers. Rubin, J. A. & Levy, P. (1975). Art-awareness: A method for working with groups. Group Psychotherapy & Psychodrama, 28, 8-117.
Kelly, A. E. & McKillop, K. J. (1996). Consequences of revealing personal secrets. Psychological Bulletin, 120(3), 450-465.
Rubin, Z. (1970). Measurement of romantic love. Journal of Personality and Social Psychology, 16, 265-273.
Kroger, R. O. (1994). The Catholic Confession and everyday self - disclosure. In Siegfried, J. (ed). The status of common sense in psychology (pp. 98-120). Westport, CT, US: Ablex Publishing.
Starr, P. D. (1975). Self - disclosure and stress among Middle - Eastern university students. Journal of Social Psychology, 97(1), 141-142.
Levinger, G. & Senn, D. J. (1967). Disclosure of feelings in marriage. Merrill-Palmer Quarterly,
Taylor, D. A., & Gould, R. J., & Brounstein, P. J. (1981). Effects of personalistic self - disclosure. Personality and Social Psychology Bulletin, 7(3), 487-492.
83
Exercise
Give the best type of reliability to use in the following cases. ___________________1. A scale measuring motivation was correlated on a scale measuring laziness, a negative coefficient was expected. ___________________2. An achievement test on personality theories was administered to psychology majors, and the same test was administered among engineering students who have not taken the scores. It is expected that there would be a significant difference on the mean scores of the two groups. ___________________3. The 16 PF that measures 16 personality factors were intercorrelated with the 12 factors of the Edwards personality Preference Schedule (EPPS). Both instruments are measures of personality but contains different factors. ___________________4. The multifactorial metamemory questionnaire (MMQ) arrived with three factors when factor analysis was conducted. It had a total of 57 items that originally belong to 5 factors. ___________________5. The scores on the depression diagnostic scale was correlated with the Minnesota Multiphasic Personality Inventory (MMPI). It was found that clients who are diagnosed to be depressive have high scores on the factors of MMPI. ___________________6. The scores of Mike’s mental ability taken during fourth year high school was used in order to determine whether he will be qualified to enter in the college he want to study. ___________________7. Maria who went for drug rehabilitation was assessed using the selfconcept test and her records in the company where she was working at were requested that contains her previous security scale scores. The two tests were compared. ___________________8. Mrs. Ocampo a math teacher before preparing her test constructs a table of specifications and after making the items it was checked by her subject area coordinator. ___________________9. In an experiment, self disclosure of participants were obtained by having three raters listen to the recordings between a counselor and client having a counseling session. The raters used an ad hoc self-disclosure inventory and later their ratings were compared using the coefficient of concordance. The concordance indicates whether the three raters agree on their ratings. ___________________10. A test measuring “sensitivity” was constructed in order to establish its reliability the scores for each items were entered in a spreadsheet to determine whether the responses for each were consistent.
84
___________________11. The items of a newly constructed personality test measuring Carl Jung’s psychological functions used a lickert scale. The scores for each item were correlated with all possible combinations. ___________________12. A test on science was made by Ms. Asuncion a science teacher. After scoring each test she determined the internal consistency of items. ___________________13. In a battery of tests, the section A class received both the Strong Vocational Interest Blank (SVIB) and the Jackson Vocational Interest Survey (JVIS). Both are measures of vocational interest and the scores are correlated to determine if one measures the same construct. ___________________14. The Work Values Inventory (WVI) was separated into 2 forms and two set of scores were generated. The two set of scores were correlated to see if they measure the same construct. ___________________15. Children’s moral judgment was studied if it would change overtime. It was administered during the first week of classes then another at the end of the first quarter. ___________________16. The study of values was deigned to measure 6 basic interests, motives, or evaluative attitudes such as theoretical, economic, aesthetic, social, political, and religious. These six factors were derived after a validity analysis. ___________________17. When the EPPS items were presented in a free choice format, the scores correlated quite highly with the scores obtained with the regular forced-choice form of the test. ___________________18. The two forms of the MMPI (Form F and form K scales) were correlated to detect faking or response sets. ___________________19. In a study by Miranda, Cantina and Cagandahan (2004) they intercorrelated the 15 factors of the Edwards Personal Preference Inventory.
85
Lesson 3 Item Difficulty and Item Discrimination
Students are usually keen in determining whether an item is difficult or easy and whether the test is a good test or a bad test based on their own judgment. A test item being judged as easy or difficult is referred to as item difficulty and whether a test is good or bad is referred to as item discrimination. Identifying a test items’ difficulty and discrimination is referred to as item analysis. Two approaches will be presented in this chapter on item analysis: Classical Test Theory (CTT) and Item Response Theory (IRT). A detailed discussion on the difference between the CTT and IRT is found at the end of Lesson 3.
Classical Test Theory Regarded as the “True Score Theory.” Responses of examinees are due only to variation in ability of interest. All other potential sources of variation existing in the testing materials such as external conditions or internal conditions of examinees are assumed either to be constant through rigorous standardization or to have an effect that is nonsystematic or random by nature. The focus of CTT is the frequency of correct responses (to indicate question difficulty); frequency of responses (to examine distracters); and reliability of the test and item-total correlation (to evaluate discrimination at the item level). Item Response Theory Synonymous with latent trait theory, strong true score theory, or modern mental test theory. It is more applicable to for tests with right and wrong (dichotomous) responses. It is an approach to testing based on item analysis considering the chance of getting particular items right or wrong. In IRT, each item on a test has its own item characteristic curve that describes the probability of getting each particular item right or wrong given the ability of the test takers (Kaplan & Saccuzzo, 1997).
Item difficulty is the percentage of examinees responding correctly to each item in the test. Generally, an item difficulty is difficult if a large percentage of the test takers are not able to answer it correctly. On the other hand, an item is easy if a large percentage of the test takers are able to answer it correctly (Payne, 1992). Item discrimination refers to the relation of performance on each item to performance on the total score (Payne, 1992). An item can discriminate if most of the high-scoring test takers are able to answer the item correctly and an item will have a low discriminating power if the lowscoring test takers can equally answer the test item correct as contrasted with the high-scoring test takers. Procedure for Determining Index of Item Difficulty and Discrimination 1. Arrange the test papers in order from highest to lowest.
86
2. Identify the high and low scoring group by getting the upper 27% and lower 27%. For example there are 20 test takers, the 27% of the 5 test takers is 5.4, rounding it off will give 5 test takers. This means that the top 5 (high scoring test-takers) and the bottom 5 (low scoring testtakers) test takers will be included in the item analysis. 3. Tabulate the correct and incorrect responses of the high and low test-takers for each item. For example, in the table below there are 5 test takers in the high group (test takers 1 to 5) and 5 test takers in the low group (test takers 6 to 10). Test taker 1 and 2 in the high group got a correct response for items 1 to 5. Test taker 3 was wrong in item 5 marked as “0.” High test takers Group
Low test takers group
Test taker 1 Test taker 2 Test taker 3 Test taker 4 Test taker 5 Total Test taker 6 Test taker 7 Test taker 8 Test taker 9 Test taker 10 Total
Item 1 1 1 1 1 1 5 1 0 1 1 0 3
Item 2 1 1 1 0 1 4 1 1 1 0 0 3
Item 3 1 1 1 1 1 5 0 1 0 0 0 1
Item 4 1 1 1 1 0 4 0 0 0 0 1 1
Item 5 1 1 0 0 0 2 0 0 0 0 0 0
Total 5 5 4 4 3 2 2 2 1 1
4. Get the total correct response for each item and convert it into a proportion. The proportion is obtained by dividing the total correct response of each item to the total number of test takers in the group. For example, in item 2, 4 is the total correct response and dividing it by 5 which is the total test takers in the high group will give a proportion of .8. The procedure is done for the high and low group. pH = Total Correct Response N per group High test takers Group
Low test takers group
Test taker 1 Test taker 2 Test taker 3 Test taker 4 Test taker 5 Total Proportion of the High Group (pH) Test taker 6 Test taker 7 Test taker 8 Test taker 9 Test taker 10 Total Proportion of the low group (pL)
pL = Total Correct Response N per group
Item 1 1 1 1 1 1 5 1
Item 2 1 1 1 0 1 4 .8
Item 3 1 1 1 1 1 5 1
Item 4 1 1 1 1 0 4 .8
Item 5 1 1 0 0 0 2 .4
Total 5 5 4 4 3
1 0 1 1 0 3 .6
1 1 1 0 0 3 .6
0 1 0 0 0 1 .2
0 0 0 0 1 1 .2
0 0 0 0 0 0 0
2 2 2 1 1
87
5. Obtain the item difficulty by adding the proportion of the high group (pH) and proportion of the low group (pL) and dividing by 2 for each item. Item difficulty =
Proportion of the High Group (pH) Proportion of the low group (pL) Item difficulty Interpretation
pH − pL 2
Item 1 1
Item 2 .8
Item 3 1
Item 4 .8
Item 5 .4
.6
.6
.2
.2
0
.8 Easy item
.7 Easy item
.6 Average item
.55 Average item
.2 Difficult item
The table below is used to interpret the index of difficulty. Given the table below, items 1 and 2 are easy items because they have high correct response proportions for both high and low group. Items 3 and 4 are average items because the proportions are within the .25 and .75 middle bound. Item 5 is a difficult item considering that there are low proportions correct for the high and low group. In the case of item 5, only 40% are able to answer in the high group and none got it correct in the low group (0). Generally as the index of difficulty reaches a value of “0,” the more difficult an item is, as it reaches “1,” it becomes easy. Difficulty Index .76 or higher .25 to .75 .24 or lower
Remark Easy Item Average Item Difficult Item
6. Obtain the item discrimination by getting the difference between the proportion of the high group and proportion of the low group for each item. Item discrimination=pH – pL Proportion of the High Group (pH) Proportion of the low group (pL) Item discrimination Interpretation
Item 1 1
Item 2 .8
Item 3 1
Item 4 .8
Item 5 .4
.6
.6
.2
.2
0
.4 Very good item
.2 Reasonably good item
.8 Very good item
.6 Very good item
.4 Very good item
The table below is used to interpret the index discrimination. Generally, the larger the difference between the proportion of the high and low group, the item becomes good because it shows a large gap in the correct response between the high and low group as shown by items 1, 3, 4, and 5. In the case of item 2, a large proportion of the low group (60%) got the item correct as contrasted with the high group (80%) resulting with a small difference (20%) making the item only reasonably good.
88
Index discrimination .40 and above .30 - .39 .20 - .29 .10 - .19 Below .10
Remark Very good item Good item Reasonably Good item Marginal item Poor item
Analyzing Item Distracters Analyzing item distracters involve determining whether the options in a multiple response item type are effective. In multiple response types such as a multiple choice, the test taker will choose from among the options or distracters the correct answer. In creating distracters, the test developer ensures they belong in the same category where they are close to the answer. For example: What cognitive skill is demonstrated in the objective “Students will compose a five paragraph essay about their reflection on modern day heroes”? a. b. c. d.
Understanding Evaluating Applying Creating
Correct answer: d The distracters for the given item are all cognitive skills in Bloom’s revised taxonomy where all can be a possible answer but there is one best answer. In analyzing whether the distracters are effective, the frequency of examinees selecting each option is reported. Group
Group size
High 15 Low 15 Correct answer
Options a 1 1
b 3 6
c 1 1
d 10 7
Total no. of correct
Difficulty Index
17
.57
Discrimination Index .20
For the given item with the correct answer of letter d, majority of the examinees in the high and low group preferred option “d” which is the correct answer. Among the high group, distracters a, b, and C are not effective distracters because there are very few examinees who selected them. For the low group, option “b” can be an effective distracter because 40% (6 examinees) of the examinees selected it as their answer as opposed to 47% (7 examinees) of them got the correct answer. In this case distracters “a” and “c” need some revision by making it close to the answer to make it more attractive for test takers.
89
EMPIRICAL REPORT Construction and Development of a Test Instrument Carlo Magno Abstract This study investigated the psychometric properties and item analysis of a one-unit test in geography for grade three students. The skills and contents of the test were based on the contents covered for the first quarter that is indicated in the syllabus. A table of specifications was constructed to frame the items into three cognitive skills that include knowledge, comprehension, and application. The test has a total of 40 items on 10 different test types. The items were reviewed by a social studies teacher and academic coordinator. The split-half reliability was used and a correlation of .3 was obtained. Each test type was correlated and resulted from low and high coefficients. The item analysis showed that most of the items turned out to be easy and most are good items.
The purpose of this study is to construct and analyze the items of a one-unit geography test for grade three students. The test basically measures grade three student’s achievement on Philippine Geography for the first quarter that served as a quarterly test. The test when standardized through validation and reliability would be used for future achievement test in Philippine Geography. There is a need to construct and standardize a particular achievement test in Philippine Geography since there is none yet available locally. The test is in Filipino language because of the nature of the subject. The subject cover topics on (1) Kapuluan ng Pilipinas; (2) Malalaki at Maliliit na Pulo ng Bansa; (3) Mapa at Uri ng
Mapa; (4) Mga Direksyon; (5) Anyong Lupa at Anyong Tubig; (6) Simbolong Ginagamit sa Mapa; (7) Panahon at Klima; (8) Mga Salik na may Kinalaman sa Klima; (9) Mga Pangunahing Hanapbuhay sa Bansa; (10) Pag-aangkop sa Kapaligiran. The topics were based upon the lessons provided by the Elementary Learning Competence from the Department of Education. The test aims for the students to: (1) Identify the important concepts and definitions; (2) comprehend and explain the reasons for given situations and phenomena; (3) Use and analyze different kinds of maps in identifying important symbols and familiarity of places. Method Search for Skills and Content Domain The skills and contents of the test were identified based on the topics covered for grade three students in the first quarter. The test is intended to be administered for the first quarter exam. The skills intended for the first quarter’s topic include identifying concepts and terms, comprehending explanations, applying principles on situations, using and analyzing maps, synthesizing different explanations for a particular event, and evaluating the truthfulness and validity of reasons and statements through inference. In constructing the test, a table of specifications was first constructed to plan out the distribution of items for each topic and the objectives to be gained by the students.
90
Table 1. Table of Specification for a unit in Philippine Geography for Grade 3 Nilalaman Natutukoy ang Nauunawaan ang Nagagamit at Total Number mahahalagang mga dahilan sa nasusuri ang mapa of Items konsepto at mahahalagang sa pagtukoy ng kahulugan kapaliwangan sa mga mahahalagang bawat sitwasyon pananda Kapuluang Pilipinas 4 4 Malalaki at maliliit na 4 4 pulo ng bansa Mapa at Uri ng mapa 4 4 Mga direksyon Anying lupa at Anyong Tubig Simbolong ginagamit sa mapa Panahon at Klima Mga salik na may kinalaman sa lima Mga pangunahing hanapbuhay ng bansa Pag-aangkop sa kapaligiran Total Number of Items Percentage
6
6 5
4
4
3
5 2
5
2 2 3
3
3
3
11
16
13
40
27.5%
40%
32.5%
100
Table of Specifications The Table of Specification contains 10 topics taken which is a unit about Philippine Geography. The 27.5% of the items were placed for the knowledge level, 40% were placed for comprehension, and 32.5% were placed on the application level. Most of the items were concentrated on the comprehension since the main purpose is for the students to understand and comprehend the unit on Philippine Geography and it is the foundation knowledge for the entire lesson for the school year. Having mastered this base knowledge will help students explain and give reasons for the next lessons that will be taken. Also, most of the items were distributed on the application level since the students need to learn practically how to use maps, and how could they benefit from using maps and figures of the unit. Few items were
placed on the knowledge part since there is a little need for the students to recall and memorize concepts and terms. The main highlight of this unit is to gain the ability to explain geographical principles on Philippine geography and its relatedness to our culture. Item Writing There were 40 items constructed based on the Table of Specification (see Table 1). A 40item test is just enough for grade three students since it is not too much or few for their capacity. Also in determining the amount of items to place on the test, the attention span and time frame for testing is considered. Basically in the quarterly test, a particular test on a subject is given a time limit of one hour. The items were based more from what the students gained from the discussion in the
91
classroom, reflection on the topic, work exercises, group works, activities in school, and from the book. The items were divided into 10 parts in the test. Test I contains four items in a True or False type. Test II contains 5 items in a matching type of test. Test III contains 2 items in a multiple choice type and the stem item is bases on a figure presented. Test IV contains 4 items within 2 situations. Test V contains 4 items in a multiple choice type, a physical map as a basis for answering. Test VI another multiple choice type and concentrates on the use of different types of map. Test VII a short answer type of test in which the students will supply what direction is asked from the question base on a map presented containing 6 items. Test VIII a 5-item interpretive exercise type of test in which a situation is given and for each situation inferences were listed and the task of the students is to choose the best inference applicable for the given situation. Test IX a three-item multiple choice type in which the students will answer depending on a figure of a Philippine map and whether condition id given. Test X a three-point essay question evaluated according to the (a) correctness of answer (1.5pts) ; (b) Explanation (1 pt); and, (c) followed instruction (o.5 pt). There were two raters who evaluated the answer for the essay type of test. Content Validation The test was content validated and reviewed by a teacher in Social Studies from Ateneo de Davao. The suggestions were considered and the test was revised accordingly. Also, before arriving with final draft of test for administration, it was checked by the Academic coordinator of the School where the test will be administered whether the items are appropriate for the level of grade three students and some typographical errors. In the process of content validation, the topics covered and the table of specification was provided in order to determine whether the items were generally covered for the topics studied.
Test Administration Respondents. There were 88 grade 3 students in three sections who took the test for the purpose of a Quarter Examination. Out of the 88 students, the top 40 students were the ones that were included in the sample. There are 11 (27%) respondents each for the upper and lower group which scores is subjected for item analysis for difficulty and discrimination. Procedure. The teacher for grade 3 Sibika at Kultura directly instructed the two other teachers who will administer the test for the two other sections. It was kept into consideration the constancy and the other factors that would affect the students’ performance on the test. The test was administered simultaneously for the three classes in the morning as the first test to be taken for that day. The students took the test for one hour, some students were able to finish the test ahead of time, and they were just advised to review their work. When the bell rang the teacher instructed the students to pass their paper forward. All the test papers were gathered and were checked. After a week the students were informed about their results and the top 40 students that were included in the sample for study was informed about the teachers’ concern for their test. A letter of request for the parents was sent to inform them about the purpose of the research and the students’ score, the parents replied positively. Data-Analysis. The scores were tabulated and encoded in so that the computation of the results will be easy. The split-half method for obtaining the internal consistency among the scores was employed. The odd and the even items were separated and were correlated in using the Pearson’s r moment correlation coefficient. The upper and lower groups were chosen according to 27% of the lowest and the highest among the 40 respondents. The item analysis was employed by computing for each item’s difficulty and the item discrimination. The remark for each item was then given according to the standards of difficulty and its discrimination, whether a good item or not. The Coefficient of Concordance was used in order to inter-rater
92
reliability of the essay type of test. There were two judges who evaluated and used criteria to score the essay part of the test.
scores correlated. The last item was not included since it has no partner item to be correlated with because the other items were essay type in which subjected to a different analysis. The low coefficient of internal consistency can also be accounted with the various types of tests used, thus can be accounted with the variation and difference s in the performance of the respondents. In other words, the respondents may respond and perform differently for each type of test. The nature of the test cannot be measured on its general homogeneity since the test contains several topics and several types of format responses. Thus, respondents perform differently for different types of test. The test has 10 types measuring different skills such as identifying the important concepts and definitions, comprehension and explanations on the reasons for given situations and phenomena, and using and analyzing different kinds of maps in identifying important symbols and familiarity of places. Although the dilemma is that the content domains included in the test is part of a general topic on Philippine geography. To test the internal consistency among the 9 different contents, correlation matrix was done.
Result and Discussion Reliability The test’s reliability was generated through the split-half method by correlating the odd numbered and even numbered items. The arrived internal consistency is 0.3, which is low but definite correlation among the items. The low correlation between the odd and even numbered items can be accounted with the different topic contents within the 40-item test. It should have been more appropriate to construct a large pool of items for the 10 content topics or factors that the test have, but 40 items is the usual standard of items of the school for the quarterly test. The test has been administered for the purpose of quarterly test because the usability of the test is considered. With regards with this type of measure it can only be accounted with the reliability of half of the test. This explains the low value of the correlation coefficient. The split-half coefficient is then transformed into a spearman brown coefficient since the correlation is only for the half of the test. The resulting SpearmanBrown coefficient is 0.46 which means that the items have a moderate relationship. Also, it is a rule of thumb that there should at least be 30 pairs of scores to be correlated, but in this case there were only 18
Table2. Intercorrelation among the Nine contents of the Test.
I II III IV V VI VII VIII IX
I --0.13 0.98* 0.18 -0.21 0.19 -0.73 0.07 0.85*
II
III
IV
V
VI
VII
VIII
IX
-1 -0.81* -0.42 0.58* 0.28 -0.19 -0.58*
--0.48* 0.47* 0.47* 0.4I* -0.47* 0.48*
--0.19 0.6 -0.56* 0.96* 0.15
--0.65* 0.73* 0.08 0.97*
--0.24 -0.8 -0.52*
--0.25 -0.52*
-0.28
--
93
There is a high relationship between test I and test IX. The higher the scores on identification of concepts the higher the scores on comprehension of weather map. Also, a high relationship existed between test V and test IX. The higher the scores on the interpretation of a physical map the higher the scores on interpretation of the weather map. There is also a high relationship between test IV and test VIII. The higher the scores on the inference about the Philippine islands, the higher the scores on the comprehension on weather. Generally, the results on inter-correlation among the contents showed pretty crude results due to the few items and the items for each type of the test were not equal. The pairing in the computation was done base on the minimum number of items for each test type. Item Difficulty and Index Discrimination To evaluate the quality of each type of item in the test, item analysis was done by determining each items difficulty and index discrimination. The proportion of examinees getting each of items correctly was evaluated according to the scale below. Difficulty Index .76 or higher .25 to .75 .24 or lower
Remark Easy Item Average Item Difficult Item
Source: Lamberte, B. (1998). Determining the Scientific Usefulness of Classroom Achievement Test. Cutting Edge Seminar. De La Salle University.
Table 3 indicates each item’s difficulty value and discrimination index value. The difficulty index shows a pattern that 67.6% of the items are easy and 32.43% of the test is on the average scale. Considering that the test was constructed or grade three students the teacher was putting it down on the level of the student’s capacity and ability. But it may also mean that the students gained mastery of the subject matter that most of them are able to answer it correctly. It should be taken note that the easiness and difficulty of the
items are dictated on the proportion of the students who answered the item correctly. In this case, most of the respondents got the answer that is why most of the items turned out to be easy. It can be accounted that in general, the test was fairly easy since most of the items turned out 76% and above. Also, Table 3 indicated the index discrimination of each item. There were 27% items that are considered poor. These items were rejected since most scores is in the high range of the low group and some scores of the low group are near to the scores of the high group who have answered it correctly. Considering the poor items such as item 2,4, 9, 13, 15, 30, 31, 32, 33, and 34 the pattern is indicative. There are very few marginal items that are subjected for improvement. There are only 8% (3 items) that are remarked as marginal since the scores of the low group and the high groups are almost the same. This means that both the high and the low group can answer this item fairly. 21.6% (8 items) of the items are reasonably good items since there is enough interval between the high and low groups. Also there are few items remarked as good items and enough to be considered as very good items. 16.21% of the items are good items and 24.3% are very good items. There is a pattern that there is a wide distance of scores between the high group and the low group. Interrater Reliability The coefficient of concordance was used to determine the degree of agreement between the two raters who judged the essay type in the test. The essay type basically measures the student’s knowledge on the adaptation of farmers in farming. The criteria used for rating the essay is that: (a) at least 2 answers are correct (1.5pts); (b) the answer was explained (1 pt); (c) and the instruction on answering was followed (0.5 pt). The results indicate that here is low agreement between the two raters. A high value of W which is 0.74 was computed indicating close
94
concordance between the raters. This means that the two raters showed a small variation in rating the answers in the essay. The small error of variance can be accounted with the difference of the disposition of the two raters. The first rater was the actual teacher in the subject but the second rater was also an Araling Panlipunan teacher but teaching in the higher level. There was a difference on how they view the answer even though they talked about the rating procedure at the start.
the scores on the interpretation of a physical map the higher the scores on the interpretation of the weather map and also the higher the scores on the inference about the Philippine Islands, the higher the scores on the comprehension on topics about weather. A high correlation coefficient was found between these types. Although the results may not be too accurate since the basis for the matrix comparison does not have equal number of items and the minimum number of items were the only ones subjected in the analysis. It is recommended that equal number of items for each test should be made to account a more accurate result in the regression analysis. There is also a low agreement between the two raters for the essay type since they have different perceptions on giving points for the answers. The item difficulty showed the most of the items are easy since the students have gained mastery of the subject matter. The index discrimination showed that the items are distributed according to its power. There are almost equal number of items that are poor (27%), marginal item (8%), reasonably good (22%), good (16%) and very good (24%).
Conclusion A low internal consistency was generated due to the different subject content in the test and each test measures different skills. These two factors affected the internal consistency of the test. It is indeed difficult to make it entirely uniform since the subject contents are required as minimum learning competence by the Department of education. Also the listed subject contents are the planned focus for the first quarter of the schools subject matter budgeting. A multiple regression analysis was performed to observe the relationship among the test types. It was found that the higher
Table 3. Item Discrimination and Index Discrimination. Item No.
Total
High Group
Low Group
PH
PL
Difficulty Index
Remark
1
32
11
7
1
0.636
0.818
Easy Item
Item Discriminat ion 0.364
2 3
26 34
7 11
6 7
0.636 1
0.545 0.636
0.591 0.818
Average Item Easy Item
0.091 0.364
4 5
38 36
11 11
10 8
1 1
0.909 0.727
0.955 0.864
Easy Item Easy Item
0.909 0.273
6
34
11
5
1
0.455
0.727
Average Item
0.545
7
33
10
8
0.909
0.727
0.818
Easy Item
0.182
Remark
Good item Poor item Good item Poor item Reasona bly Good item Very Good item Marginal item
95 8
34
11
8
1
0.909
0.864
Easy Item
0.273
9 10
39 24
11 9
10 4
1 0.818
0.634 0.456
0.955 0.591
Easy Item Average Item
0.091 0.455
11
23
9
5
0.818
0.273
0.636
Average Item
0.364
12
22
10
3
0.818
0.818
0.545
Average Item
0.545
13 14
36 34
11 11
9 8
0.909 1
0.727 1
0.864 0.864
Easy Item Easy Item
0.091 0.273
15 16
39 28
10 10
11 5
1 0.909
1 0.455
1 0.682
Easy Item Average Item
0 0.455
17
28
11
5
0.909
0.455
0.682
Average Item
0.455
18
34
10
7
1
0.636
0.818
Easy Item
0.364
19
24
11
5
0.909
0.455
0.682
Average Item
0.455
20
37
11
8
1
0.727
0.864
Easy Item
0.273
21
29
11
5
1
0.455
0.727
Average Item
0.545
22
26
11
5
1
0.455
0.727
Average Item
0.545
23
33
11
7
1
0.636
0.818
Easy Item
0.364
24
37
11
8
1
0.727
0.864
Easy Item
0.273
25
37
7
8
1
0.818
0.864
Easy Item
0.273
26
24
11
9
1
0.364
0.909
Easy Item
0.182
27
37
11
4
0.636
0.818
0.5
Average Item
0.273
Reasona bly Good item Poor item Very Good item Good item Very Good item Poor item Marginal item Poor item Very Good item Very Good item Good item Very Good item Reasona bly Good item Very Good item Very Good item Good item Reasona bly Good item Reasona bly Good item Marginal item Reasona bly Good
96
28
35
11
9
1
0.636
0.909
Easy Item
0.182
29
39
11
7
1
0.909
0.818
Easy Item
0.364
30 31 32 33 34 35
40 40 40 40 40 27
11 11 11 11 11 11
10 11 11 11 11 3
1 1 1 1 1 1
1 1 1 1 1 0.273
0.955 1 1 1 1 0.636
Easy Item Easy Item Easy Item Easy Item Easy Item Easy Item
0.091 0 0 0 0 0.727
36
24
9
4
0.818
0.364
0.591
Average Item
0.455
37
36
11
7
1
0.636
0.818
Easy Item
0.364
item Marginal item Good item Poor item Poor item Poor item Poor item Poor item Marginal item Very Good item Good item
Item Response Theory: Obtaining Item difficulty Using the Rasch Model It is said that the IRT is an approach to testing based on item analysis considering the chance of getting particular items right or wrong. In IRT, each item on a test has its own item characteristic curve that describes the probability of getting each particular item right or wrong given the ability of the test takers (Kaplan & Saccuzzo, 1997). This will be realized at the latter section in the computational procedure. In using the Rasch model as an approach for determining item difficulty, the calibration of test item difficulty is independent of the person used for the calibration unlike in the classical test theory approach where it is dependent on the group. The method of test calibration does not matter whose responses to these items use for comparison. It gives the same results regardless on who takes the test. The score a person obtains on the test can be used to remove the influence of their abilities from the estimation of their difficulty. Thus, the result is a sample free item calibration. Rasch’s (1960), the proponent who derived the technique, intended to eliminate references to populations of examinees in analyses of tests unlike in classical test theory where norms are used to interpret test scores. According to him that test analysis would only be worthwhile if it were individual centered with separate parameters for the items and the examinees (van der Linden & Hambleton, 2004). The Rasch model is a probabilistic unidimensional model which asserts that: (1) The easier the question the more likely the student will respond correctly to it, and (2) the more able the student, the more likely he/she will pass the question compared to a less able student. When the data fit the Rasch model, the relative difficulties of the questions are independent of the relative abilities of the students, and vice versa (Rasch, 1977). As shown in the graph below (Figure 1), a function of ability (θ) which is a latent trait forms the boundary between the probability areas of answering an item incorrectly and answering the item correctly.
97
Figure 1 Item Characteristic Curves of an 18-item Mathematical Problem Solving Test
Easy item
Easy item
Easy item
Easy item
Difficult item
Difficult item
Difficult item
In the item characteristic curve, the score on the item represents ability (θ) and the x-axis is the range of item difficulties in log functions. It can be noticed that items 1, 7, 14, 2, 8, and 15 do not require high ability to be answered correctly as compared t items 5, 12, 18, and 11 that require high ability. The item characteristic curves are judged within 50% of the ability and a cut off of “0” on item discrimination. The curves within the left side of the “0” item difficulty as marked in the 50% ability are easy items and the ones on the right side are difficult items. The program called WINSTEPS was used to produce the curves. The IRT Rasch model basically identifies the location of a persons’ ability in a set of items for a given test. The test items has a predefined set of difficulties, the person’s position should be reflective that his ability should be matched with the difficult of the items. The ability of the person as symbolized by θ and the items as δ. In the figure below, there are 10 items (δ1 to
98
δ10), and the location of the person’s ability (θ) is in between δ7 and δ8. In the continuum, the items are prearranged from the easiest (at the left) to the most difficult (at the right). If the position of the person’s ability is between δ7 and δ8, then it is expected that the person taking the test should be able to answer items δ1 to δ6 (“1” correct response, “0” incorrect response), since this items are answerable given the level of ability of the person. This kind of calibration is said to fit the Rasch model where the position of the person’s ability is within a defined line of item difficulties. Case 1
In Case 2, the person is able to answer four difficult items and unable to respond correctly with the easy items. There is now difficulty in locating the person in the continuum. If the items are valid measures of ability, then the easy items should be more answerable than the difficult ones. This means that the items are not suited for the person’s ability. This case do not fit the Rasch model. Case 2
The Rasch model allow to estimate person ability (θ) through their score on the test and the item’s difficulty (δ) through the item correct separately that’s why it is considered to be test free and sample free. In different cases, it can be encountered that the person’s response (θ) to the test is higher than the specified item difficulty (δ), so their difference (θ–δ) is greater than zero. But when the ability or response (θ) is less than the specified item difficulty (δ), their difference (θ–δ) is less than 0 as in Case 2. When the ability of the person (θ) is equivalent to the item’s difficulty (δ), the difference (θ–δ) is 0 as in Case 1. This variation in person responses and item’s difficulty is
99
represented in an Item Characteristic Curve (ICC) which show the way the item elicits responses from persons of every ability (Wright & Stone, 1979). Figure 1 ICC of a Given Ability and Item Difficulty
An estimate of response x is obtained when a person with ability (θ) is acting on an item with diffuculty (δ). It can be specified in the model that in the interaction between ability (θ) and item difficulty (δ) that when ability is greater than the difficulty, the probability of getting the correct answer is more than .5 or 50%. When the ability is less than the difficulty, the probability of of getting the correct answer is less than .5 or 50%. The variation of these estimates on the probability of getting a correct response is illustrated in Figure 1. The mathematical units for θ and δ are defined in logistic functions (ln) to produce a linear scale and generality of measure. The next section guides you in estimating the calibration of item difficulty and person ability measure. Procedure for the Rasch Model The Rasch model will be used for the responses of 10 students in a 25 item problem solving test. In determining the item difficulty in the Rasch model, all participants who took the test are included unlike the classical test theory where the upper and lower 27% are the only ones included in the analysis.
100
Examinees 1 9 0 10 1 5 1 3 0 8 1 1 1 6 0 7 0 4 1 2 0 5 Total
2 1 1 0 0 0 0 1 0 0 0
3 1 0 0 1 1 1 0 1 0 1
4 0 0 0 0 0 0 1 0 0 0
5 0 1 0 0 1 0 1 0 0 0
6 1 1 0 1 1 0 0 0 0 0
7 1 0 1 1 1 1 1 1 0 1
8 1 0 0 1 0 1 0 1 0 1
9 10 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0
ITEM NUMBER 11 12 13 14 15 1 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 1 1
16 1 0 0 0 0 0 0 0 0 0
17 1 1 1 0 0 0 0 0 1 0
18 0 1 1 0 0 0 0 0 0 0
19 1 0 1 1 0 0 1 1 0 1
20 1 1 0 0 1 0 1 1 0 0
21 1 0 0 0 0 1 0 0 0 0
22 1 1 1 1 0 0 1 0 0 0
23 1 1 0 0 1 0 1 0 1 0
24 1 0 1 0 0 0 0 0 0 0
25 1 1 1 1 1 1 1 1 1 1
total 20 13 11 10 10 9 9 9 8 7
3 6 1 3 4 8 5 2 4 3 3 6 5 6 1 4 2 6 5 2 5 5 2 10
Grouped Distribution of Different Item Scores 1. 2. 3. 4. 5. 6. 7.
Code each score for each item as “1” for right answer and “0” for wrong answer. Arrange the scores (persons) from highest to lowest Remove items where the all respondents got it correct Remove items where the all respondents got it wrong Rearrange scores (person from highest to lowest). Group the items with similar total item score (si) Indicate the frequency (fi) of items for each group of items s 8. Divide each total item score (si) with N (ρi) = proportion correct ρ i = i n 9. Subtract 1 with the ρi = proportion incorrect (1 – ρi) 10. Divide the proportion incorrect with the proportion correct and get the natural log of this 1 − ρ i using a scientific calculator = logit incorrect (xi) xi = ln ρ i 11. Multiply the frequency (fi) with the Logit incorrect (xi)=fixi 12. Square the xi and multiply with each fi=fixi2 13. Compute for the value of x• Σf x x• = i i Σf i 14. To get the initial item calibration (doi) subtract the logit incorrect (xi) with the x• (doi= xi- x•) 15. Estimate the value of U which will be used later in the final estimates. Σf i ( xi ) 2 − [(Σfi )( x•) 2 ] U= Σf i − 1
101
Table 1 Grouped Distribution of the 7 Different Item Scores of 10 Examinees item score group index
1
item name
7
item score
item frequency
Proportion correct
proportion incorrect
logit incorrect
Frequency X logit
Frequency X logit2
initial item calibration
si
fi
ρi
1-ρi
xi
fixi
fi(xi)2
doi= xi- x•
8
1
0.8
0.2
-1.39
-1.39
1.92
-1.87
6
4
0.6
0.4
-0.41
-1.62
0.66
-0.89
5
6
0.5
0.5
0.00
0.00
0.00
-0.48
4
3, 13, 15, 19 1, 8, 14, 20, 22, 23 6, 10, 17
4
3
0.4
0.6
0.41
1.22
0.49
-0.07
5
2, 5, 11, 12
3
4
0.3
0.7
0.85
3.39
2.87
0.37
6
9, 18, 21, 24
2
4
0.2
0.8
1.39
5.55
7.69
0.91
7
4, 16
1
2
0.1
0.9
2.20
4.39
9.66 Σfi(xi)2 =23.29
1.72
2
3
Σfi=24
11.54 24
x• =
Σf i x i Σf i
U=
Σf i ( xi ) 2 − [(Σfi )( x•) 2 ] Σf i − 1
x• =
Σ fixi =11.54
x• = 0.48
U=
23.29 − [( 24)(0.48) 2 ] 24 − 1
U = 0.77
Grouped Distribution of Observed Person Scores 16. Count the number of possible scores (r) for each of the person’s total score (L). 17. Count the number of persons for each possible score=person frequency (nr) r 18. Divide each possible score with the total score = proportion correct ρ r = L 19. Obtain the proportion incorrect by subtracting the proportion correct with 1 (1-ρr) 20. Determine the logit correct (yr) of the quotient between proportion correct (ρr)and ρ proportion incorrect (1-ρr) y r = ln r 1 − ρ r 21. Multiply the logit correct (yr)with each person frequency (nr) (nryr) 22. Square the values of the logit correct (yr2) and multiply with the person frequency (nr) 23. The logit correct (yr)is the initial person measure (bro=yr) 24. Compute for the value of y• and V to be used later in the final estimates.
102
Table 2 Grouped Distribution of Observed Examinee Scores on the 24 Item Mathematical Problem Solving Test
Frequency X Logit nryr
Frequency X Logit2 nr(yr)2
Initial Person Measure bro=yr
-0.89 -0.69 -0.51 -0.34 -0.17 0.00
-0.89 -0.69 -1.53 -0.67 -0.17 0.00
0.79 0.48 0.78 0.23 0.03 0.00
-0.89 -0.69 -0.51 -0.34 -0.17 0.00
0.54 0.58 0.63 0.67 0.71 0.75
0.17 0.34 0.51 0.69 0.89 1.10
0.17 0.00 0.00 0.00 0.00 0.00
0.03 0.00 0.00 0.00 0.00 0.00
0.17 0.34 0.51 0.69 0.89 1.10
0.79 0.83
1.34 1.61
0.00 1.61 Σnryr =-2.18
0.00 2.59 Σnr(yr)2= 4.92
1.34 1.61
Possible score
Person frequency nr
Proportion correct ρr
Logit correct yr
7 8 9 10 11 12
1 1 3 2 1 0
0.29 0.33 0.38 0.42 0.46 0.50
13 14 15 16 17 18
1 0 0 0 0 0
19 20
0 1 Σnr=10
y• =
V =
Σn r y r Σn r
y• =
− 2.18 y•= -0.22 10
Σnr ( y r ) 2 − Σn r ( y •) 2 Σn r − 1
V=
4.92 − [10( −0.22) 2 ] V = 0.49 10 − 1
Final Estimates of Item difficulty 25. Compute for the expansion factor (Y) V 0.49 1+ 1+ 2.89 2.89 Y = Y = Y = 1.11 (0.77)(0.49) UV 1− 1− 8.35 8.35 V = 0.49 (from Table 2) U = 0.77 (from Table 1)
103
26. Multiply the expansion factor (Y) with the initial item calibration (dio). The item score group index, item name, and initial item calibration is taken from Table 1. 27. Compute the Standard Error (SE) for each item scores SE ( d i ) = Y
N S i ( N − Si )
Table 3 Final Estimates of Item Difficulties from 10 Examinees
7
Initial Item Calibration d oi -1.87
Sample Spread Expansion Factor Y 1.11
Corrected Item Calibration di=Ydoi -2.07
Item Score si 8
Calibration Standard Error SE (di) 0.878
2
3, 13, 15, 19
-0.89
1.11
-0.98
6
0.717
3
1, 8, 14, 20, 22, 23
-0.48
1.11
-0.53
5
0.702
4
6, 10, 17
-0.07
1.11
-0.08
4
0.717
5
2, 5, 11, 12
0.37
1.11
0.41
3
0.766
6 7
9, 18, 21, 24 4, 16
0.91 1.72
1.11 1.11
1.01 1.91
2 1 N=10
0.878 1.170
Item Score Group Index i 1
Item Name
Final Estimates of Person measures
28. Compute for the value of X U 2.89 X = UV 1− 8.35 1+
0.77 2.89 X = X = 1.18 (0.77)(0.49) 1− 8.35 1+
V = 0.49 (from Table 2) U = 0.77 (from Table 1) 29. Multiply the expansion factor (X) with each of the initial measure (bro) = corrected measure to obtain the corrected measure (br). The possible score and initial measure is taken from Table 2.
104
30. Compute for the Standardized Error (SE). SE = X
L r( L − r)
Possible score r 7 8 9 10
Initial measure b ro -0.89 -0.69 -0.51 -0.34
Test width expansion factor X 1.18 1.18 1.18 1.18
corrected measure br=Xbro -1.05 -0.82 -0.60 -0.40
Measure standard error
nr
0.53 0.51 0.50 0.49
1 1 3 2
11 12 13 14 15 16
-0.17 0.00 0.17 0.34 0.51 0.69
1.18 1.18 1.18 1.18 1.18 1.18
-0.20 0.00 0.20 0.40 0.60 0.82
0.48 0.48 0.48 0.49 0.50 0.51
1 0 1 0 0 0
17 18 19 20
0.89 1.10 1.34 1.61
1.18 1.18 1.18 1.18
1.05 1.30 1.57 1.90
0.53 0.56 0.59 0.65
0 0 0 1
105
Figure 3 Item Map for the Calibrated Item Difficulty and Person Ability
Item 3 z=1.4
Items that do not Require high ability (δ<θ)
Item 1 Z=.4
item 8 z=-.1
Item 14 Z=.2
Item 13 z=1.0
Item 20
Item 6 Z=0
Item 15 z=.6
Item 22
Item 10
δ=θ
Item 2 Z=.6
Items that require high ability (δ >θ)
Item 9 Z=.8
Item 5 Z=.3
Item 18 Z=1.0
Item 11 Z= -.8
Item 21
Item 16 Z= -.1
items
logit
(item 7)1
-2.07
0
0 (Item 19) 4 z= -.5
-1.05
1 (Case2)
-0.98
0
0
-0.82
1 (Case 4)
0
-0.60
3 (Case 1)
(Item 23) 6
-0.53
0
0
-0.40
2 (Case 3)
0 (Item 17) 3 Z= -.9
-0.20
1 (Case 5)
-0.08
0
0
0.00
0
0
0.20
1 (Case 10)
0 (Item 12) 4 Z= -.12
0.40
0
0.41
0
0
0.60
0
0 (Item 24) 4
0.82
0
1.01
0
0
1.05
0
0
1.30
0
0
1.57
0
0 (Item 4) 7
1.90
1 (Case 9)
1.91
0
Score
persons
7
8 Case 6
Case 8
Case 7
9
10 11
13
20
Figure 3 shows the item map of calibrated item difficulty (left side) and person ability (right side) across their logit values. Observe that as the items become more difficult (increasing logits) the person with the highest score (high ability) is matched close with the item. This match is termed as a goodness of fit in the Rasch model. A good fit is indicates that difficult items require high ability to be answered correctly. More specifically the match in the logits of person ability and item difficulty indicates a goodness of fit. In this case the goodness of fit of the item difficulties are estimated using the z value. Lower z and non significant z values indicates a goodness of fit of the item difficulty and person ability.
106
EMPIRICAL REPORT The Application of a One-Parameter IRT Model on a Test of Mathematical Problem Solving Carlo Magno Chang Young Hai Abstract The purpose of this research was to examine the validity of a Mathematical Problem Solving Test for fourth year high school students and to compare traditional and Raschbased scores in their ability. The Mathematical Problem Solving test was administered to 31 fourth year high school students studying in two Chinese schools, and the data were submitted to Rasch analysis. Traditional and Raschbased scores for a sample of fourth year high school students were submitted to analyses of variance with group by comparing log and SE values across the test. Twentytwo items demonstrated acceptable model fit. The Rasch model accounted for 26% of the variance in the responses to the remaining items. The findings generally support the test’s validity. Finally, the results suggest to further explores the dimensionality of problem solving as a construct.
Problem solving has a special importance in the study of mathematics. A primary goal of mathematics teaching and learning is to develop the ability to solve a wide variety of complex mathematics problems. Stanic and Kilpatrick (1988) traced the role of problem solving in school mathematics and illustrated a rich history of the topic. To many mathematically literate people, mathematics is synonymous with solving problems-doing word problems, creating patterns, interpreting figures, developing geometric constructions, proving theorems, etc. The rhetoric of problem solving has been so pervasive in the mathematics education of the 1980s and 1990s that creative speakers and writers can put a twist on whatever topic or activity they have in mind to call it problem solving. Every exercise of problem solving research has gone through some agony of defining mathematics problem solving. Reitman (1965) defined a problem as when you have been given the description of something but do not yet have anything that satisfies that
description. Reitman's discussion described a problem solver as a person perceiving and accepting a goal without an immediate means of reaching the goal. Henderson and Pingry (1953) wrote that to be problem solving there must be a goal, a blocking of that goal for the individual, and acceptance of that goal by the individual. What is a problem for one student may not be a problem for another -- either because there is no blocking or no acceptance of the goal. Schoenfeld (1985) also pointed out that defining what is a problem is always relative to the individual. The measure of mathematical ability through problem solving is subject to fluctuations as any other ability constructs. Due to these fluctuations, the measure of person ability and item difficulty needs to be calibrating in a logistical Model. An analysis that offers this technique is using the one-parameter Rasch Model. Research on Problem Solving Various research methodologies are used in mathematics education research including a clinical approach that is frequently used to study problem solving. Typically, mathematical tasks or problem situations are devised, and students are studied as they perform the tasks. Often they are asked to talk aloud while working or they are interviewed and asked to reflect on their experience and especially their thinking processes. Waters (1984) discusses the advantages and disadvantages of four different methods of measuring strategy use involving a clinical approach. Schoenfeld (1983) describes how a clinical approach may be used with pairs of students in an interview. He indicates that "dialog between students often serves to make managerial decisions overt, whereas such decisions are rarely overt in single student protocols."
107
The basis for most mathematics problem solving research for secondary school students in the past 31 years can be found in the writings of Polya (1973, 1962, 1965), the field of cognitive psychology, and specifically in cognitive science. Cognitive psychologists and cognitive scientists seek to develop or validate theories of human learning (Frederiksen, 1984) whereas mathematics educators seek to understand how their students interact with mathematics (Schoenfeld, 1985; Silver, 1987). The area of cognitive science has particularly relied on computer simulations of problem solving (25,50). If a computer program generates a sequence of behaviors similar to the sequence for human subjects, then that program is a model or theory of the behavior. Newell and Simon (1972), Larkin (1980), and Bobrow (1964) have provided simulations of mathematical problem solving. These simulations may be used to better understand mathematics problem solving. Constructivist theories have received considerable acceptance in mathematics education in recent years. In the constructivist perspective, the learner must be actively involved in the construction of one's own knowledge rather than passively receiving knowledge. The teacher's responsibility is to arrange situations and contexts within which the learner constructs appropriate knowledge (Steffe & Wood, 1990; von Glasersfeld, 1989). Even though the constructivist view of mathematics learning is appealing and the theory has formed the basis for many studies at the elementary level, research at the secondary level is lacking. However, constructivism is consistent with current cognitive theories of problem solving and mathematical views of problem solving involving exploration, pattern finding, and mathematical thinking (Schoenfeld, 1988; Kaput, 1979; National Council of Supervisors of Mathematics, 1978) thus teachers are urged and teacher educators become familiar with constructivist views and evaluate these views for restructuring their approaches to teaching, learning, and research dealing with problem solving.
Problem Solving as a Process Garofola and Lester (1985) have suggested that students are largely unaware of the processes involved in problem solving and that addressing this issue within problem solving instruction may be important. Domain Specific Knowledge. To become a good problem solver in mathematics, one must develop a base of mathematics knowledge. How effective one is in organizing that knowledge also contributes to successful problem solving. Kantowski (1974) found that those students with a good knowledge base were most able to use the heuristics in geometry instruction. Schoenfeld and Herrmann (1982) found that novices attended to surface features of problems whereas experts categorized problems on the basis of the fundamental principles involved. Silver (1987) found that successful problem solvers were more likely to categorize math problems on the basis of their underlying similarities in mathematical structure. Wilson (1967) found that general heuristics had utility only when preceded by task specific heuristics. The task specific heuristics were often specific to the problem domain, such as the tactic most students develop in working with trigonometric identities to "convert all expressions to functions of sine and cosine and do algebraic simplification." Algorithms. An algorithm is a procedure, applicable to a particular type of exercise, which, if followed correctly, is guaranteed to give you the answer to the exercise. Algorithms are important in mathematics and our instruction must develop them but the process of carrying out an algorithm, even a complicated one, is not problem solving. The process of creating an algorithm, however, and generalizing it to a specific set of applications can be problem solving. Thus problem solving can be incorporated into the curriculum by having students create their own algorithms. Research involving this approach is currently more
108
prevalent at the elementary level within the context of constructivist theories. Heuristics. Heuristics are kinds of information, available to students in making decisions during problem solving, that are aids to the generation of a solution, plausible in nature rather than prescriptive, seldom providing infallible guidance, and variable in results. Somewhat synonymous terms are strategies, techniques, and rules-of-thumb. For example, admonitions to "simplify an algebraic expression by removing parentheses," to "make a table," to "restate the problem in your own words," or to "draw a figure to suggest the line of argument for a proof" are heuristic in nature. Out of context, they have no particular value, but incorporated into situations of doing mathematics they can be quite powerful (Polya, 1973; Polya, 1962; Polya, 1965). Theories of mathematics problem solving (Newell & Simon, 1972; Schoenfeld, 1985; Wilson, 1967) have placed a major focus on the role of heuristics. Surely it seems that providing explicit instruction on the development and use of heuristics should enhance problem solving performance; yet it is not that simple. Schoenfeld (1985) and Lesh (1981) have pointed out the limitations of such a simplistic analysis. Theories must be enlarged to incorporate classroom contexts, past knowledge and experience, and beliefs. What Polya (1967) describes in How to Solve It is far more complex than any theories we have developed so far. Mathematics instruction stressing heuristic processes has been the focus of several studies. Kantowski (1977) used heuristic instruction to enhance the geometry problem solving performance of secondary school students. Wilson (1967) and Smith (1974) examined contrasts of general and task specific heuristics. These studies revealed that task specific hueristic instruction was more effective than general hueristic instruction. Jensen (1984) used the heuristic of subgoal generation to enable students to form problem solving plans. He used thinking aloud, peer interaction, playing
the role of teacher, and direct instruction to develop students' abilities to generate subgoals. It is useful to develop a framework to think about the processes involved in mathematics problem solving. Most formulations of a problem solving framework in U. S. textbooks attribute some relationship to Polya's (1973) problem solving stages. However, it is important to note that Polya's "stages" were more flexible than the "steps" often delineated in textbooks. These stages were described as understanding the problem, making a plan, carrying out the plan, and looking back. According to Polya (1965), problem solving was a major theme of doing mathematics and "teaching students to think" was of primary importance. "How to think" is a theme that underlies much of genuine inquiry and problem solving in mathematics. However, care must be taken so that efforts to teach students "how to think" in mathematics problem solving do not get transformed into teaching "what to think" or "what to do." This is, in particular, a byproduct of an emphasis on procedural knowledge about problem solving as seen in the linear frameworks of U. S. mathematics textbooks and the very limited problems/exercises included in lessons. Clearly, the linear nature of the models used in numerous textbooks does not promote the spirit of Polya's stages and his goal of teaching students to think. By their nature, all of these traditional models have the following defects: 1. They depict problem solving as a linear process. 2. They present problem solving as a series of steps. 3. They imply that solving mathematics problems is a procedure to be memorized, practiced, and habituated. 4. They lead to an emphasis on answer getting. These linear formulations are not very consistent with genuine problem solving activity. They may, however, be consistent with how experienced problem solvers present their solutions and answers after the problem solving
109
is completed. In an analogous way, mathematicians present their proofs in very concise terms, but the most elegant of proofs may fail to convey the dynamic inquiry that went on in constructing the proof. Another aspect of problem solving that is seldom included in textbooks is problem posing, or problem formulation. Although there has been little research in this area, this activity has been gaining considerable attention in U. S. mathematics education in recent years. Brown and Walter (1983) have provided the major work on problem posing. Indeed, the examples and strategies they illustrate show a powerful and dynamic side to problem posing activities. Polya (1972) did not talk specifically about problem posing, but much of the spirit and format of problem posing is included in his illustrations of looking back. A framework is needed that emphasizes the dynamic and cyclic nature of genuine problem solving. A student may begin with a problem and engage in thought and activity to understand it. The student attempts to make a plan and in the process may discover a need to understand the problem better. Or when a plan has been formed, the student may attempt to carry it out and be unable to do so. The next activity may be attempting to make a new plan, or going back to develop a new understanding of the problem, or posing a new (possibly related) problem to work on. Problem solving abilities, beliefs, attitudes, and performance develop in contexts (Schoenfeld, 1988) and those contexts must be studied as well as specific problem solving activities. Rasch Analysis Rasch analysis (Bond & Fox, 2001; Rasch, 1980; Wright & Stone, 1979) offers potential advantages over the traditional psychometric methods of classical test theory. It has been widely applied in health status assessment (e.g., Antonucci, Aprile, & Paulucci, 2002; Duncan, Bode, Lai, & Perera, 2003; Fortinsky, Garcia, Sheenan, Madigan, & Tullai-
McGuinness, 2003; Lai, Cella, Chang, Bode, & Heinemann, 2003; Linacre, Heinemann, Wright, Granger, & Hamilton, 1994; Velozo, Magalhaes, Pan, & Leiter, 1995; Ware, Bjorner, & Kosinski, 2000) but has rarely been used in mathematical problem solving assessment (Willmes, 1981, 1992). Its primary advantages include the interval nature of the measures it provides and the theoretical independence of item difficulty and person ability scores from the particular samples used to estimate them. The Rasch model, also referred to in the item response theory literature as the oneparameter logistic model, estimates the probability of a correct response to a given item as a function of item difficulty and person ability. The primary output of Rasch analysis is a set of item difficulty and person ability values placed along a single interval scale. Items with higher difficulty scores are less likely to be answered correctly, and items with lower scores are more likely to elicit correct responses. By the same token, persons with higher ability are more likely to provide correct responses, and those with lower ability are less likely to do so. Rasch analysis (a) estimates the difficulty of dichotomous items as the natural logarithm of the odds of answering each item correctly (a log odds, or logit score), (b) typically scales these estimates to mean = 0, and then (c) estimates person ability scores on the same scale. In analysis of dichotomous items, item difficulty and person ability are defined such that when they are equal, there is a 50% chance of a correct response. As person ability exceeds item difficulty, the chance of a correct response increases as a logistic ogive function, and as item difficulty exceeds person ability, the chance of success decreases. The formal relationship among response probability, person ability, and item difficulty is given in the mathematical equation by Bond and Fox (2001, p. 201). A graphic plot of this relationship, known as the item characteristic curve (ICC), is given for three items of different difficulty levels. One useful feature of the Rasch model is referred to as parameter separation or specific
110
objectivity (Bond & Fox, 2001; Embretson & Reise, 2000). The implication of this mathematical property is that, at least in theory, item difficulty values do not depend on the person sample used to estimate them, nor do person ability scores depend on the particular items used to estimate them. In practical terms, this means that given well-calibrated sets of items that fit the Rasch model, robust and directly comparable ability estimates may be obtained from different subsets of items. This, in turn, facilitates both adaptive testing and the equating of scores obtained from different instruments (Bond & Fox, 2001; Embretson & Reise, 2000). Rasch theory makes a number of explicit assumptions about the construct to be measured and the items used to measure it, two of which have already been discussed above. The first is that all test items respond to the same unidimensional construct. One set of tools for examining the extent to which test items approximate unidimensionality are the fit statistics provided by Rasch analysis. These fit statistics indicate the amount of variation between model expectations and observations. They identify items and people eliciting unexpected responses, such as when a person of high ability responds incorrectly to an easy question, perhaps because of carelessness or because of a poorly constructed or administered item. Fit statistics can be informative with respect to dimensionality because they indicate when different people may be responding to different aspects of an item's content or the testing situation. A second key assumption of Rasch analysis, also mentioned above, is that individuals can be placed on an ordered continuum along the dimension of interest, from those having less ability to those having more (Bond & Fox, 2001). Similarly, the analysis assumes that items may be placed on the same scale, from those requiring less ability to those requiring more. A third assumption underlying Rasch analysis is that of local, or conditional, independence (Embretson & Reise, 2000; Wainer & Mislevy, 2000). This assumption
requires that individual items do not influence one another (i.e., they are uncorrelated, once the dimension of item difficulty-person ability is taken into account). Thus, no considerations of item content, beyond their difficulty values, are necessary for estimating person ability, and changing the order of item administration should not change item or person estimates. In mathematical terms, this assumption states that the probability of a string of responses is equal to the product of the individual probabilities of each of the separate responses comprising it. Failure to meet this assumption can suggest the presence of another dimension in the data. Local dependence is often a concern in the construction of reading comprehension tests that include multiple questions about the same passage, because responses to such questions may be determined not only by the difficulty of each individual item but also by the difficulty and content of the passage. Responses to items of this type are often intercorrelated even after their individual difficulties have been taken into account. To give another example, if a particular question occurring earlier in a test provides specific information about the answer to a later question, then these two items are also likely to demonstrate local dependence. A final important assumption of the Rasch model is that the slope of the item characteristic curve, also known as the item discrimination parameter, is equal to 1 for all items (Bond & Fox, 2001; Embretson & Reise, 2000; Wainer & Mislevy, 2000). This assumption is presented graphically in Figure 1, where all three curves are parallel with a slope equal to 1. The consequence of this assumption is that a given change in ability level will have the same effect on the log odds of a correct response for all items. Items that have different discrimination values, a given change in ability has different consequences for different items. When an item's discrimination parameter is high, a relatively small change in ability level results in a large change in response probability. When discrimination is low, larger changes in ability level are needed to change response probability.
111
A highly discriminating item (i.e., one with a high ICC slope) is more likely to result in different responses from two individuals of different ability levels, whereas an item with a low discrimination parameter (i.e. a low ICC slope) more often results in the same response from both. Rasch models have been shown to be robust to small and/or unsystematic violations of this assumption (Penfield, 2004; van de Vijver, 1986), but when the ICC slopes in an item set differ substantially and/or systematically from 1, the test developer is advised to reconsider the extent to which the offending items measure the relevant construct (Wright, 1991). An example ion the use of the oneparameter Rasch Model is the study by El-Korashy (1995) where the Rasch Model was applied to the selection of items for an Arabic version of the Otis-Lennon Mental Ability Test. Correspondence of item calibration to person measurement indicated that the test is suitable for the range of mental ability intended to be measured. Another is the study by Lamprianou (2004) that analyzes data from three testing cycles of the National Curriculum tests in mathematics in England using the Rasch model. It was found that pupils having English as an additional language and pupils belonging to ethnic minorities are significantly more likely to generate aberrant response patterns. However, within the groups of pupils belonging to ethnic minorities, those who speak English as an additional language are not significantly more likely to generate misfitting response patterns. This may indicate that the ethnic background effect is more significant than the effect of the first language spoken. The results suggest that pupils having English as an additional language and pupils belonging to ethnic minorities are mismeasured significantly more than the remainder of pupils by taking the mathematics National Curriculum tests. More research is needed to generalize the results to other subjects and contexts. Purpose of the Study 1. In the current investigation, the Rasch model was used to analyze a set Mathematiocal
Problem Solving data provided by a sample of fourth year high school students in two Chinese Schools. One purpose of the study was to determine whether the construct validity of the test is supported by Rasch analysis. Specifically, it is hypothesized that the test responds to a cohesive unidimensional construct. Item fit statistics, a Rasch-based unidimensionality coefficient, and principal-components analysis of model residuals were used to evaluate this hypothesis. 2. To test the hypothesis that Rasch estimates of person ability, because of their status as interval-level measures, are more valid and sensitive than traditionally computed scores. Method Participants The participants were 31 high school students from two different schools. The two high schools are UNO High School and Grace Christian High School. These two high schools were chosen for their popularity in molding high achievers in Mathematics. The participants were fourth year high school students, both male and female students and belonging to the 16-18 age group. The decision to choose high school students was made because the high school educational system was much more regimented, and it can be safely assumed that any given fourth year student would have studied the lessons required of a third year student. Convenient sampling was used to select the respondents. Instrument Mathematical Problem Solving Test. The Mathematical Problem Solving test was constructed to measure the problem solving ability of the students (seer Appendix A). There are 25 items included in the test that covers third year high school lessons. Third year lessons were used because the participants will only be starting their fourth year in high school, and might not have enough knowledge of fourth year math.
112
The coverage of the test includes fractions, factoring, simple algebraic equations and various word problems. These factors are based on the Merle S. Alferez (MSA) Review Questions for All College Entrance Test (ACET) and University of the Philippines College Admissions Test (UPCAT), and the College Entrance Test Reviewer third edition. A professor from the Mathematics Department of De La Salle University-Manila was asked to critique the items in the Mathematical Problem Solving Test. The item reviewer was given a copy of the Table of Specifications. This table served to orient about the nature of the items used in the test. The proponent then explained the purpose of the test in order to revise the items to better fulfill the objectives of the exam. After the mathematical problem solving test was revised, it was pre-tested on 10 high school students from Saint Jude Catholic School to determine the length of time needed by students in answering the entire test. The Mathematical Problem Solving Test was then given to 31 fourth year high school students for pilot testing. The data from the pilot testing were used for reliability and item analysis. The Kuder-Richardson reliability was used to determine the internal consistency of the items of the Mathematical Problem Solving Test. This method was used to be able to find the consistency of the responses on all the items in the test. The test has an internal consistency of .84 based on the KR #20. The skewness of the distribution of scores is somehow negatively skewed with a value of -.158. The distribution of scores has a kurtosis of -1.05. The overall mean of the test performance of the participants in the pilot test is 16.23 with a standard deviation of standard deviation of 5.45. This shows that scores of 17 to 25 are high in problem solving and a score of 15 and below are below average. A standard deviation of 5.45 means that the individual scores are dispersed.
Procedure The Mathematical Problem Solving Test was administered to fourth Year High school students of two Chinese Schools in Manila. A letter requesting to administer the test was sent to the Math teacher. The mathematics teacher was given detailed instructions on how to administer the test. A copy of the instructions to be given to the students were provided so that the administration would be constant across situations. After administering the test the students and teachers were debriefed about the purpose of the study. Data Analysis To describe the distribution of the scores, the mean, standard deviation, kurtosis, and skewness were obtained. The reliability of the items were evaluated using the Kuder Richardson #20. Item Analysis was conducted using both Classical Test Theory (CTT) and Item Response Theory (IRT). In the CTT the item difficulty and item discrimination were determined using the proportion of the high group and the low group. Item difficulty is determined by getting the average proportion of correct responses between the high group and low group. The Item discrimination is determined by computing for the difference between the high group and the low group. The estimation of Rasch item difficulty and person ability scores and related analyses were carried out using WINSTEPS. This software package begins with provisional central estimates of item difficulty and person ability parameters, compares expected responses based on these estimates to the data, constructs new parameter estimates using maximum likelihood estimation, and then reiterates the analysis until the change between successive iterations is small enough to satisfy a preselected criterion value. The item parameter estimates are typically scaled to have M = 0, and person ability scores are estimated in reference to the item mean. A unit on this scale, a logit, represents the
113
change in ability or difficulty necessary to change the odds of a correct response by a factor of 2.718, the base of the natural logarithm. Persons who respond to all items correctly or incorrectly, and items to which all persons respond correctly or incorrectly, are uninformative with respect to item difficulty estimation and are thus excluded from the parameter estimation process. Results Item analysis was used to evaluate whether the items in the Mathematical Problem Solving Test are easy, average or difficult. The difficulty of an item is based on the percentage of people who answered it correctly. The index discrimination revealed that there are no marginal items as well as bad items; however, 84% of the items are very good, 2% are good items and 2% are reasonably good items. In the item difficulty, the each item indicates whether it is easy, average or difficult. Item difficulty is determined if the items have the appropriate difficulty level. It was found out that there are no difficult items presented, although 72% of the items are average and 28% are easy. One Parameter-Rasch Model When the test scores and ability of the students in the Mathematical Problem Solving Test was calibrated new indices for the reliability was obtained. The student reliability was .50 with a RMSE of .52 and the Math reliability is .34 with an RMSE of .82. The errors associated with these estimates are high indicating that the data does not fit well the expected ability and test difficulty. Figure 1 shows the test characteristic curve generated by the WINSTEPS. In the computed separation for ability is 1.20 and the item (expected score) is 11 which is .73 when converted into a standardized estimate. Although these extreme values are adjusted by fine tuning the slopes produced for each item. The characteristic curve shows that Items 5, 1, 6 and 4 have the probability of being
answered with low ability while items 3, 7 and 2 requires higher ability to get a correct response. The characteristic curve shows that Items 11, 13, 2 and 8 have the probability of being answered with low ability while items 9, 10 and 4 requires higher ability to get a correct response. The overlap between items 11 and 13 and items 9 and 109 means that the same ability are required to get the probability of answering the item correct. The characteristic curve shows that Items 16 and 19 have the probability of being answered with low ability while items 17, 18, 20, 21 and 22 requires higher ability to get a correct response. The overlap between items 18, 20, 21, and 22, and items 16 and 19 means that the same ability are required to get the probability of answering the item correct. Items 23, 24 and 25 are excluded because of extreme responses. Examination of Fit The average INFIT statistics is 1.00 and average OUTFIT statistics is .98 which indicates that the data for the items are showing goodness of fit because the value is less than 1.5 except for items 23, 24 and 25. Unidimensionality Coefficient To address the question of construct dimensionality, a Rasch unidimensionality coefficient was calculated. This coefficient was calculated as the ratio of the person separation reliability estimated using model standard errors (which treat model misfit as random variation) to the person separation reliability estimated using real standard errors (which regard misfit as true departure from the unidimensional model; Wright, 1994). The closer the value of the coefficient to 1.0, the more closely the data approximate unidimensionality. The unidimensionality coefficient for the current data set was .61 (ratio of 1.20 and .73 separation values) which is quite marginal to 1.00. This means that the data might form dimensions. Principal Components analysis shows that there can possibly be 7 factors that can be
114
formed with the items excluding item 25 with no variation as indicated in the scree plot. Principal-components analysis of model residuals conducted for the 24-item pool (after exclusion of the seven misfitting items) revealed that 26.97% of the variance in the observations was accounted for by the Rasch dimension of item difficulty-person ability. The next largest factor extracted accounted for only 4.86% of the remaining variance. The log functions for each item shows large standard errors. This supports the principal components analysis that there might be factors formed out of the 22 items. Discussion The present results generally support the construct and content validity of the Mathematical problem solving test. First, the acceptable fit of the 22 test items to the Rasch model and the marginal unidimensionality coefficient (.61) support the hypothesis that the RTT measures a unidimensional construct. Furthermore, acceptable item and person separation indices and reliability coefficients suggest that the parameter estimates obtained in the current study are both reproducible and useful for differentiating items and persons from one another. In addition, principal-components analysis of Rasch model residuals (with the two misfitting items excluded) indicated that the dimension of person ability-item difficulty accounted for the majority of the variance in the data (26.97%) and the next largest factor extracted accounted for very little additional variance (4.86%). Although this does not provide further support for the unidimensionality of the test. The pattern of item difficulty across subtests was consistent with item content and similar for values derived by Rasch analysis and traditional methods. As expected, based on increasing lexical load, the results showed variation in the difficulty. There are more items that can be answered requiring low ability
because of the additional lexical load imposed by the inclusion of size adjectives. Aspects of the tests validity were supported by the present analyses. First, two items were only excluded because of poor model fit. Perhaps participants were not generally able to figure out the proper response strategy by the end of the test (because of the provision of repeats and cues) and were then able to effectively implement problem solving strategies. If this is correct, then eliminating these items should introduce misfit for the items of this type. The two other items that were excluded because of poor model fit were the last test item, which differs from the earlier items in that it contains two-part commands and requires responses using more skills. This suggests that initial responses to different kinds of commands might be determined in part by another construct, for example, ability to switch set. A second aspect of the test’s validity that the present analysis failed to confirm concerns the homogeneity of item difficulty within subtests. The differences between the parameter estimates within the items suggest that they are not necessarily homogeneous with respect to difficulty. The present finding might have been in part the result of a relatively small and poorly targeted sample. A larger sample with a broader distribution might obtain less item variability. Although sample sizes of approximately 100 have been argued to produce stable item parameter estimates (Linacre, 1994; van de Vijver, 1986), larger samples are preferable. Willmes's (1981) prior finding suggests that the present result may be reliable, but his participant sample was similarly sized, if perhaps better targeted. References Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 41, 17-116. Antonucci, G., Aprile, T., & Paulucci, S. (2002). Rasch analysis of the Rivermead Mobility Index: A study using mobility measures of first-stroke inpatients. Archives of Physical Medicine and Rehabilitation, 83, 1442-1449.
115 Arvedson, J. C., McNeil, M. R., & West, T. L. (1986). Prediction of Revised Token Test overall, subtest, and linguistic unit scores by two shortened versions. Clinical Aphasiology, 16, 57-63. Blackwell, A., & Bates, E. (1995). Inducing agrammatic profiles in normals: Evidence for the selective vulnerability of morphology under cognitive resource limitation. Journal of Cognitive Neuroscience, 7, 228257. Bobrow, D. G. (1964). Natural language input for a computer problem solving system. Unpublished doctoral dissertation, Massachusetts Institute of Technology, Boston. Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Erlbaum. Briggs, D. C., & Wilson, M. (2003). An introduction to multidimensional measurement using Rasch models. Journal of Applied Measurement, 4, 87-100. Brown, S. I. & Walter, M. I. (1983). The art of problem posing. Hillsdale, NJ: Lawrence Erlbaum. Chang, W-C., & Chan, C. (1995). Rasch analysis for outcomes measures: Some methodological considerations. Archives of Physical Medicine and Rehabilitation, 76, 934-939. Cliff, N. (1992). Abstract measurement theory and the revolution that never happened. Psychological Science, 3, 186-190. DiSimoni, F. G., Keith, R. L., & Darley, F. L. (1980). Prediction of PICA overall score by short versions of the test. Journal of Speech and Hearing Research, 23, 511-516. Duffy, J. R., & Dale, B. J. (1977). The PICA scoring scale: Do its statistical shortcomings cause clinical problems? In R. H. Brookshire (Ed.), Collected proceedings from clinical aphasiology (pp. 290-296). Minneapolis, MN: BRK. Duncan, P. W., Bode, R., Lai, S. M., & Perera, S. (2003). Rasch analysis of a new stroke-specific outcome scale: The Stroke Impact Scale. Archives of Physical Medicine and Rehabilitation, 84, 950-963. Efron, B., & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1, 54-77. El-Korashy, A. (1995). Applying the Rasch model to the selection of items for a mental ability test. Educational and Psychological Measurement, 55, 753. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum. Fischer, G. H., & Molenaar, I. W. (1995). Rasch models: Foundations, recent developments and applications. New York: Springer. Fortinsky, R. H., Garcia, R. I., Sheenan, T. J., Madigan, E. A., & Tullai McGuinness, S. (2003). Measuring disability in Medicare home care patients: Application
of Rasch modeling to the Outcome and Assessment Information Set. Medical Care, 41, 601-615. Frederiksen, N. (1984). Implications of cognitive theory for instruction in problem solving. Review of Educational Research, 54, 363-407. Freed, D. B., Marshall, R. C., & Chulantseff, E. A. (1996). Picture naming variability: A methodological consideration of inconsistent naming responses in fluent and nonfluent aphasia. In R. H. Brookshire (Ed.), Clinical aphasiology conference (pp. 193-205). Austin, TX: Pro-Ed. Garfola, J. & Lester, F. K. (1985). Metacognition, cognitive monitoring, and mathematical performance. Journal for Research in Mathematics Education, 16, 163-176. Guilford, J. P. (1954). Psychometric methods. New York: McGraw-Hill. Henderson, K. B. & Pingry, R. E. (1953). Problem solving in mathematics. In H. F. Fehr (Ed.), The learning of mathematics: Its theory and practice (21st Yearbook of the National Council of Teachers of Mathematics) (pp. 228-270). Washington, DC: National Council of Teachers of Mathematics. Hobart, J. C. (2002). Measuring disease impact in disabling neurological conditions: Are patients' perspectives and scientific rigor compatible? Current Opinions in Neurology, 15, 721-724. Howard, D., Patterson, K., Franklin, S., Morton, J., & Orchard-Lisle, V. (1984). Variability and consistency in naming by aphasic patients. Advances in Neurology, 42, 263-276. Jensen, R. (1984). A multifaceted instructional approach for developing subgoal generation skills. Unpublished doctoral dissertation, The University of Georgia. Kahneman, D. (1973). Attention and effort. Englewood Cliffs, NJ: Prentice-Hall. Kantowski, M. G. (1974). Processes involved in mathematical problem solving. Unpublished doctoral dissertation, The University of Georgia, Athens. Kantowski, M. G. (1977). Processes involved in mathematical problem solving. Journal for Research in Mathematics Education, 8, 163-180. Kaput, J. J. (1979). Mathematics learning: Roots of epistemological status. In J. Lochhead and J. Clement (Eds.), Cognitive process instruction. Philadelphia, PA: Franklin Institute Press. Lai, J-S., Cella, D., Chang, C. H., Bode, R., & Heinemann, A. W. (2003). Item banking to improve, shorten and computerize self-reported fatigue: An illustration of steps to create a core item bank from the FACITFatigue Scale. Quality of Life Research, 12, 485-501. Lamprianou, I. & Boyle, B. (2004). Accuracy of Measurement in the Context of Mathematics National Curriculum Tests in England for Ethnic Minority Pupils and Pupils Who Speak English as an Additional Language. JEM, 41, 239-251.
116 Larkin, J. (1980). Teaching problem solving in physics: The psychological laboratory and the practical classroom. In F. Reif & D. Tuma (Eds.), Problem solving in education: Issues in teaching and research. Hillsdale, NJ: Lawrence Erlbaum. Lesh, R. (1981). Applied mathematical problem solving. Educational Studies in Mathematics, 12(2), 235-265. Linacre, J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7, 328. Linacre, J. M. (1998). Structure in Rasch residuals: Why principal components analysis? Rasch Measurement Transactions, 12, 636. Linacre, J. M. (2002). Facets, factors, elements and levels. Rasch Measurement Transactions, 16, 880. Linacre, J. M., & Wright, B. D. (1994). Reasonable meansquare fit values. Rasch Measurement Transactions, 8, 370. Linacre, J. M., & Wright, B. D. (2003). WINSTEPS: Multiple-choice, rating scale, and partial credit Rasch analysis [Computer software]. Chicago: MESA Press. Linacre, J. M., Heinemann, A. W., Wright, B., Granger, C. V., & Hamilton, B. B. (1994). The structure and stability of the Functional Independence Measure. Archives of Physical Medicine and Rehabilitation, 75, 127-132. Lord, F. M., Novick, M. R., & Birnbaum, A. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology, 1, 1-27. Lumsden, J. (1978). Tests are perfectly reliable. British Journal of Mathematical and Statistical Psychology, 31, 19-26. Masters, G. (1993). Undesirable item discrimination. Rasch Measurement Transactions, 7, 289. McHorney, C. A., Haley, S. M., & Ware, J. E. (1997). Evaluation of the MOS SF-36 physical functioning scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods. Journal of Clinical Epidemiology, 50, 451-461. McNeil, M. R. (1988). Aphasia in the adult. In N. J. Lass, L. V. McReynolds, J. Northern, & D. E. Yoder (Eds.), Handbook of speech-language pathology and audiology (pp. 738-786). Toronto, Ontario, Canada: D. C. Becker. McNeil, M. R., & Hageman, C. F. (1979). Auditory processing deficits in aphasia evidenced on the Revised Token Test: Incidence and prediction of across subtest and across item within subtest patterns. In R. H. Brookshire (Ed.), Clinical aphasiology conference proceedings (pp. 47-69). Minneapolis, MN: BRK.
Merbitz, C., Morris, J., & Grip, J. C. (1989). Ordinal scales and foundations of misinference. Archives of Physical Medicine and Rehabilitation, 70, 308-312. Michell, J. (1990). An introduction to the logic of psychological measurement. Hillsdale, NJ: Erlbaum. Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88, 355-383. Michell, J. (2004). Item response models, pathological science, and the shape of error. Theory and Psychology, 14, 121-129. National Council of Supervisors of Mathematics. (1978). Position paper on basic mathematical skills. Mathematics Teacher, 71(2), 147-52. (Reprinted from position paper distributed to members January 1977.) Newell, A. & Simon, H. A. (1972). Human problem solving. Englewood Cliffs, NJ: Prentice Hall. Norquist, J. M., Fitzpatrick, R., Dawson, J., & Jenkinson, C. (2004). Comparing alternative Rasch-based methods vs. raw scores in measuring change in health. Medical Care, 42, 125-136. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill. Orgass, B. (1976). Eine Revision des Token Tests, Teil I und II [A revision of the token tests, Part I and II]. Diagnostica, 22, 70-87. Penfield, R. D. (2004). The impact of model misfit on partial credit model parameter estimates. Journal of Applied Measurement, 5, 115-128. Polya, G. (1962). Mathematical discovery: On understanding, learning and teaching problem solving (vol. 1). New York: Wiley. Polya, G. (1965). Mathematical discovery: On understanding, learning and teaching problem solving (vol. 2). New York: Wiley. Polya, G. (1973). How to solve it. Princeton, NJ: Princeton University Press. (Originally copyrighted in 1945). Porch, B. (2001). Porch Index of Communicative Ability. Albuquerque, NM: PICA Programs. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press. (Original work published 1960) Reitman, W. R. (1965). Cognition and thought. New York: Wiley. Schoenfeld, A. H. (1983). Episodes and executive decisions in mathematics problem solving. In R. Lesh & M. Landau, Acquisition of mathematics concepts and processes. New York: Academic Press Schoenfeld, A. H. (1985). Mathematical problem solving. Orlando, FL: Academic Press. Schoenfeld, A. H. (1988). When good teaching leads to bad results: The disasters of "well taught" mathematics classes. Educational Psychologist, 23, 145-166. Schoenfeld, A. H., & Herrmann, D. (1982). Problem perception and knowledge structure in expert and
117 novice mathematical problem solvers. Journal of Experimental Psychology: Learning, Memory and Cognition, 8, 484-494. Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331-354. Silver, E. A. (1987). Foundations of cognitive theory and research for mathematics problem-solving instruction. In A. H. Schoenfeld (Ed.), Cognitive science and mathematics education (pp. 33-60). Hillsdale, NJ: Lawrence Erlbaum. Smith, J. P. (1974). The effects of general versus specific heuristics in mathematical problem-solving tasks (Columbia University, 1973). Dissertation Abstracts International, 34, 2400A. Smith, R. M. (1986). Person fit in the Rasch model. Educational and Psychological Measurement, 46, 359-372. Stanic, G., & Kilpatrick, J. (1988). Historical Perspectives on Problem Solving in the Mathematics Curriculum. In R. I. Charles & E. A. Silver (Eds.), The teaching and assessing of mathematical problem solving (pp. 1-22). Reston, VA: National Council of Teachers of Mathematics. Steffe, L. P., & Wood, T. (Eds.). (1990). Transforming Children's Mathematical Education. Hillsdale, NJ: Lawrence Erlbaum. Stevens, S. S. (1946, June 7). On the theory of scales of measurement. Science, 103, 677-680. van de Vijver, F. J. R. (1986). The robustness of Rasch estimates. Applied Psychological Measurement, 10, 45-57. Velozo, C. A., Magalhaes, L. C., Pan, A.-W., & Leiter, P. (1995). Functional scale discrimination at admission and discharge: Rasch analysis of the Level of Rehabilitation Scale-III. Archives of Physical Medicine and Rehabilitation, 76, 705-712. von Glasersfeld, E. (1989). Constructivism in education. In T. Husen & T. N. Postlethwaite (Eds.), The international encyclopedia of education. (pp. 162-163). (Suppl. Vol. I). New York: Pergammon. Wainer, H., & Mislevy, R. J. (2000). Item response theory, item calibration, and proficiency estimation. In H. Wainer, N. J. Dorans, D. Eignor, R. Flaugher, B. F. Green, & R. J. Mislevy, et al. (Eds.), Computerized adaptive testing: A primer (2nd ed., pp. 61-100). Mahwah, NJ: Erlbaum. Wainer, H., Dorans, N. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., et al. (2000). Computerized adaptive testing: A primer (2nd ed.). Mahwah, NJ: Erlbaum. Ware, J. E., Bjorner, J. B., & Kosinski, M. (2000). Practical implications of item response theory and computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact scales. Medical Care, 38, II73-II82.
Waters, W. (1984). Concept acquisition tasks. In G. A. Goldin & C. E. McClintock (Eds.), Task variables in mathematical problem solving (pp. 277-296). Philadelphia, PA: Franklin Institute Press. Willmes, K. (1981). A new look at the Token Test using probabilistic test models. Neuropsychologia, 19, 631645. Willmes, K. (1992). Psychometric evaluation of neuropsychological test performances. In N. von Steinbuechel, D. Y. Cramon, & E. Poeppel (Eds.), Neuropsychological rehabilitation (pp. 103-113). Heidelberg, Germany: Springer-Verlag. Willmes, K. (2003). Psychometric issues in aphasia therapy research. In I. Papathanasiou & R. De Bleser (Eds.), The sciences of aphasia: From theory to therapy (pp. 227-244). Amsterdam: Pergamon. Wilson, J. W. (1967). Generality of heuristics as an instructional variable. Unpublished Doctoral Dissertation, Stanford University, San Jose, CA. Wright, B. D. (1991). IRT in the 1900's: Which models work best? Rasch Measurement Transactions, 6, 196-200. Wright, B. D. (1994). A Rasch unidimensionality coefficient. Rasch Measurement Transactions, 8, 385. Wright, B. D. (1996). Local dependency, correlations and principal components. Rasch Measurement Transactions, 10, 509-511. Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every psychologist and educator should know (pp. 65-104). Mahwah, NJ: Erlbaum. Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70, 857-860. Wright, B. D., & Masters, G. S. (1982). Rating scale analysis. Chicago: MESA Press. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press. Wright, B., & Masters, G. (1997). The partial credit model. In W. van der Linden & R. Hambleton (Eds.), Handbook of modern item response theory (pp. 101121). New York: Springer.
118
SPECIAL TOPIC A Review of Psychometric Theory Carlo Magno This special topic presents the nature of psychometrics including the issues on psychological measurement, its relevant theories and its current practice. The basic scaling models are discussed since it is a process enabling quantification of psychological constructs. The issues and research trends in classical test theory and item response theory with its different models and their implication on test construction are explained The Nature of Psychometrics and Issues in Measurement Psychometrics concerns itself with the science of measuring psychological constructs such as ability, personality, affect and skills. Psychological measurement methods are crucially important for basic research in psychology. Research in psychology involves the measurement of variables in order to conduct further analysis. In the past, obtaining adequate measurement on psychological constructs is considered an issue in the science of psychology. Some references indicate that there are psychological constructs that are deemed to be unobservable and is difficult to quantify. This issue is carried over by the fact that psychological theories are filled with variables that either cannot be measured at all at the present time or can be measured only approximately (Kaplan & Saccuzzo, 1997) such as anxiety, creativity, dogmatism achievement, motivation, attention and frustration. Moreover according to Emmanuel Kant that “it is impossible to have a science of psychology because the basic data could not be observed and measured.” Although the field of psychological measurement have been advanced and practitioners in the field of psychometrics were able to properly deal with issues and devise methods on the basic premise of scientific observation and measurement. Since most psychological constructs involves subjective experiences such as feelings sensations and desires – and when individuals makes a judgment, state their preferences and even talk about these experiences, then it is possible for measurement to take place and thus it meets the requirements of scientific inquiry. It is very much possible to assign numbers to psychological constructs as to represent quantities of attributes and even formulate rules of standardizing the measurement process. In the process of standardizing psychological measurement, it requires a process of abstraction where psychological attributes are observed in relation to other constructs such as attitude and achievement (Magno, 2003). This process allows to establish the association among variables such as construct validation and criterion-predictive processes. Also, emphasizing measurement of psychological constructs forces researchers and test developers to consider carefully the nature of the construct before attempting to measure it. This involves a thorough literature review on the conceptual definition of an attribute before constructing valid items of a test. It is also a common practice in psychometrics where numerical scores are used to communicate the amount of an attribute of an individual. Quantification is so much intertwined with the concept of measurement. In the process of quantification, mathematical systems and statistical procedures are used enabling to examine the internal relationship among data obtained through a measure. Such procedures enable psychometrics to build theories considering itself part of the system of science.
119
Branches of Psychometric Theory There are two branches of psychometric theory: The classical test theory and the items response theory. Both theories enable to predict outcomes of psychological tests by identifying parameters of item difficulty and the ability of test takers. Both are concerned to improve the reliability of psychological tests. Classical Test Theory Classical test theory in references is regarded as the “true score theory.” The theory starts from the assumption that systematic effects between responses of examinees are due only to variation in ability of interest. All other potential sources of variation existing in the testing materials such as external conditions or internal conditions of examinees are assumed either to be constant through rigorous standardization or to have an effect that is nonsystematic or random by nature (van der Linden & Hambleton, 2004). The central model of the classical test theory is that observed test scores (TO) are composed of a true score (T) and an error score (E) where the true and the error scores are independent. The variables are established by Spearman (1904) and Novick (1966) and best illustrated in the formula: TO = T + E The classical theory assumes that each individual has a true score that would be obtained if there were no errors in measurement. However, because measuring instruments are imperfect, the score observed for each person may differ from an individual’s true ability. The difference between the true score and the observed test score results from measurement error. Using a variety of justifications, Error is often assumed to be a random variable having a normal distribution. The implication of the classical test theory for test takers is that tests are fallible imprecise tools. The score achieved by an individual is rarely the individual’s true score. This means that the true score for an individual will not change with repeated applications of the same test. This observed score is almost always the true score influenced by some degree of error. This error influences the observed to be higher or lower. Theoretically, the standard deviation of the distribution of random errors for each individual tells about the magnitude of measurement error. It is usually assumed that the distribution of random errors will be the same for all individuals. Classical test theory uses the standard deviation of errors as the basic measure of error. Usually this is called the standard error of measurement. In practice, the standard deviation of the observed score and the reliability of the test are used to estimate the standard error of measurement (Kaplan & Saccuzzo, 1997). The larger the standard error of measurement, the less certain is the accuracy with which an attribute is measured. Conversely, small standard error of measurement tells that an individual score is probably close to the true score. The standard error of measurement is calculated with the formula: Sm = S 1 − r
Standard errors of measurement are used to create confidence intervals around specific observed scores (Kaplan & Saccuzzo, 1997). The lower and upper bound of the confidence interval approximates the value of the true score. Traditionally, methods of analysis based on classical test theory have been used to evaluate such tests. The focus of the analysis is on the total test score; frequency of correct responses (to indicate question difficulty); frequency of responses (to examine distractors); reliability of the test and item-total correlation (to evaluate discrimination at the item level) (Impara & Plake, 1997). Although these statistics
120
have been widely used, one limitation is that they relate to the sample under scrutiny and thus all the statistics that describe items and questions are sample dependent (Hambelton, 2000). This critique may not be particularly relevant where successive samples are reasonably representative and do not vary across time, but this will need to be confirmed and complex strategies have been proposed to overcome this limitation. Item Response Theory Another branch of psychometric theory is the item response theory (IRT). IRT may be regarded as roughly synonymous with latent trait theory. It is sometimes referred to as the strong true score theory or modern mental test theory because IRT is a more recent body of theory and makes stronger assumptions as compared to classical test theory. This approach to testing based on item analysis consider the chance of getting particular items right or wrong. In this approach, each item on a test has its own item characteristic curve that describes the probability of getting each particular item right or wrong given the ability of the test takers (Kaplan & Saccuzzo, 1997). The Rasch model is appropriate for modeling dichotomous responses and models the probability of an individual's correct response on a dichotomous item. The logistic item characteristic curve, a function of ability, forms the boundary between the probability areas of answering an item incorrectly and answering the item correctly. This one-parameter logistic model assumes that the discriminations of all items are assumed to be equal to one (Maier, 2001). Another fundamental feature of this theory is that item performance is related to the estimated amount of respondent’s latent trait (Anastasi & Urbina, 2002). A latent trait is symbolized as theta (θ) which refers to a statistical construct. In cognitive tests, latent traits are called the ability measured by the test. The total score on a test is taken as an estimate of that ability. A person’s specified ability (θ) succeeds on an item of specified difficulty. There are various approaches to the construction of tests using item response theory. Some approaches use the two-dimensions: Item discriminations and item difficulties are plotted. Other approaches use a three-dimension for the probability of test takers with very low levels of ability getting a correct response (as demonstrated in figure 2). Other approaches use only the difficulty parameter (one dimension) such as the Rasch Model. All these approaches characterize the item in relation to the probability that those who do well or poorly on the exam will have different levels of performance. Two – Parameter Model/Normal – Ogive Model. The ogive model postulates a normal cumulative distribution function as a response function for an item. The model demonstrates that an item difficulty is a point on an ability scale where an examinee has a probability of success on the item of .50 (van der Linden & Hambleton, 2004). In the model, the difficulty of each item can be defined by 50% threshold which is customary in establishing sensory thresholds in psychophysics. The discriminative power of each item represented by a curve in the graph is indicated by its steepness. The steeper the curve, the higher the correlation of item performance with total score and the higher the discriminative index. The original idea of the model was traced back from Thurstone’s use of the normal model in his discriminal dispersion theory of stimulus perception (Thurstone, 1927). Researchers in psychophysics study the relation between psychophysical properties from a stimuli and their perception from human subjects. Stimuli scaling will be presented in detail further in this paper. In the process a stimulus is presented to the subject and he/she will report the detection of the stimulus. The detection increases as the stimulus intensity also increases. With this pattern, the cumulative distribution with parametrization was used a s a function.
121
Three – Parameter Model/Logistic Model. In plotting an ability (θ) with the probability of correct response Pi (θ) in a three parameter model, the slope of the curve itself indicates the item discrimination. The higher the value of the item discrimination, the steeper the slope. In the model, Birnbaum (1950) proposed a third parameter to account for the nonzero performance of low ability examinees on multiple choice items. The nonzero performance is due to the probability of guessing correct answers to multiple choice items (van der Linden & Hambleton, 2004).
Figure 2. Hypothetical Item Characteristic Curves for Three Items. The item difficulty parameter (b1, b2, b3) corresponds to the location on the ability axis at which the probability of a correct response is .50. It is shown in the curve that item 1 is easier and item 2 and 3 have the same difficulty at .50 probability of correct response. Estimates of item parameters and ability are typically computed through successive approximations procedures where approximations are repeated until the values stabilize. One – Parameter Model/Rasch Model. The Rasch model is based on the assumption that both guessing and item differences in discrimination are negligible. In constructing tests, the proponents of the Rasch model frequently discard those items that do not meet these assumptions (Anastasi & Urbina, 2002). Rasch began his work in educational and psychological measurement in the late 1940’s. Early in the 1950’s he developed his Poisson models for reading tests and a model for intelligence and achievement tests which was later called the “structure models for items in a test” which is called today as the Rasch model. Rasch’s (1960) main motivation for his model was to eliminate references to populations of examinees in analyses of tests. According to him that test analysis would only be worthwhile if it were individual centered with separate parameters for the items and the examinees (van der Linden & Hambleton, 2004). His worked marked IRT with its probabilistic modeling of the interaction between an individual item and an individual examinee. The Rasch model is a probabilistic unidimensional model which asserts that (1) the easier the question the more likely the student will respond correctly to it, and (2) the more able the student, the more likely he/she will pass the question compared to a less able student . The Rasch model was derived from the initial Poisson model illustrated in the formula:
122
ε= where
ε
δ θ
is a function of parameters describing the ability of examinee and difficulty of the test,
θ
represents the ability of the examinee and δ represents the difficulty of the test which is estimated by the summation of errors in a test. Furthermore, the model was enhanced to assume that the probability that a student will correctly answer a question is a logistic function of the difference between the student's ability [ ] and the difficulty of the question [ ] (i.e. the ability required to answer the question correctly), and only a function of that difference giving way to the Rasch model
From this, the expected pattern of responses to questions can be determined given the estimated and . Even though each response to each question must depend upon the students' ability and the questions' difficulty, in the data analysis, it is possible to condition out or eliminate the student's abilities (by taking all students at the same score level) in order to estimate the relative question difficulties (Andrich, 2004; Dobby & Duckworth, 1979). Thus, when data fit the model, the relative difficulties of the questions are independent of the relative abilities of the students, and vice versa (Rasch, 1977). The further consequence of this invariance is that it justifies the use of the total score (Wright & Panchapakesan, 1969). In the current analysis this estimation is done through a pair-wise conditional maximum likelihood algorithm. The Rasch model is appropriate for modeling dichotomous responses and models the probability of an individual's correct response on a dichotomous item. The logistic item characteristic curve, a function of ability, forms the boundary between the probability areas of answering an item incorrectly and answering the item correctly. This one-parameter logistic model assumes that the discriminations of all items are assumed to be equal to one (Maier, 2001). According to Fischer (1974) the Rasch model can be derived from the following assumptions: (1) Unidimensionality. All items are functionally dependent upon only one underlying continuum. (2) Monotonicity. All item characteristic functions are strictly monotonic in the latent trait, u. The item characteristic function describes the probability of a predefined response as a function of the latent trait. (3) Local stochastic independence. Every person has a certain probability of giving a predefined response to each item and this probability is independent of the answers given to the preceding items. (4) Sufficiency of a simple sum statistic. The number of predefined responses is a sufficient statistic for the latent parameter. (5) Dichotomy of the items. For each item there are only two different responses, for example positive and negative. The Rasch model requires that an additive structure underlies the observed data. This additive structure applies to the logit of Pij, where Pij is the probability that subject i will give a predefined response to item j, being the sum of a subject scale value ui and an item scale value vj, i.e. In (Pij/1 - Pij) = ui + vj There are various applications of the Rasch Model in test construction through item-mapping method (Wang, 2003) and as a hierarchical measurement method (Maier, 2001).
123
Rasch Standard-setting Through Item-mapping. According to Wang (2003) that it is logical to justify the use of an item-mapping method for establishing passing scores for multiple-choice licensure and certification examinations. In the study the researcher wanted to determine a score that decides a passing level of competency using the Angoff as a standard setting method in the Rasch Model. The Angoff (1971) procedure with various modifications is the most widely used for multiple-choice licensure and certification examinations (Plake, 1998). As part of the Angoff standard-setting process, judges are asked to estimate the proportion (or percentage) of minimally competent candidates (MCC) who will answer an item correctly. These item performance estimates are aggregated across items and averaged across judges to yield the recommended cut score. As noted (Chang, 1999; Impara & Plake, 1997; Kane, 1994), the adequacy of a judgmental standard-setting method depends on whether the judges adequately conceptualize the minimal competency of candidates, and whether judges accurately estimate item difficulty based on their conceptualized minimal competency. A major criticism of the Angoff method is that judges' estimates of item difficulties for minimal competency are more likely to be inaccurate, and sometimes inconsistent and contradictory (Bejar, 1983; Goodwin, 1999; Mills & Melican, 1988; National Academy of Education [NAE], 1993; Reid, 1991; Shepard, 1995). Studies found that judges are able to rank order items accurately in terms of item difficulty, but they are not particularly accurate in estimating item performance for target examinee groups (Impara & Plake, 1998; National Research Council, 1999; Shepard, 1995). A fundamental flaw of the Angoff method is that it requires judges to perform the nearly impossible cognitive task of estimating the probability of MCCs answering each item in the pool correctly (Berk, 1996; NAE). An item-mapping method, which applies the Rasch IRT model to the standard setting process, has been used to remedy the cognitive deficiency in the Angoff method for multiple-choice licensure and certification examinations (McKinley, Newman, & Wiser, 1996). The Angoff method limits judges to each individual item while they make an immediate judgment of item performance for MCCs. In contrast, the item-mapping method presents a global picture of all items and their estimated difficulties in the form of a histogram chart (item map), which serves to guide and simplify the judges' process of decision making during the cut score study. The item difficulties are estimated through application of the Rasch IRT model. Like all IRT scaling methods, the Rasch estimation procedures can place item difficulty and candidate ability on the same scale. An additional advantage of the Rasch measurement scale is that the difference between a candidate's ability and an item's difficulty determines the probability of a correct response (Grosse & Wright, 1986). When candidate ability equals item difficulty, the probability of a correct answer to the item is .50. Unlike the Angoff method, which requires judges to estimate the probability of an MCC's success on an item, the item-mapping method provides the probability (i.e., .50) and asks judges to determine whether an MCC has this probability of answering an item correctly. By utilizing the Rasch model's distinct relationship between candidate ability and item difficulty, the item-mapping method enables judges to determine the passing score at the point where the item difficulty equals the MCC's ability level. The item-mapping method incorporates item performance in the standard-setting process by graphically presenting item difficulties. In item mapping, all the items for a given examination are ordered in columns, with each column in the graph representing a different item difficulty. The columns of items are ordered from easy to hard on a histogram-type graph, with very easy items toward the left end of the graph, and very hard items toward the right end of the graph. Item difficulties in log odds units are estimated through application of the Rasch IRT model (Wright & Stone, 1979). In order to present items on a metric familiar to the judges, logit difficulties are converted to scaled values using the following formula: scaled difficulty = (logit difficulty × 10) + 100. This scale usually ranges from 70 to 130.
124
Figure 3. Example of Item Map. In the example, the abscissa of the graph represents the rescaled item difficulty. Any one column has items within two points of each other. For example, the column labeled "80" has items with scaled difficulties ranging from 79 to values less than 81. Using the scaling equation, this column of items would have a range of logit difficulties from -2.1 to values less than -1.9, yielding a 0.2 logit difficulty range for items in this column. Similarly, the next column on its right has items with scaled difficulties ranging from 81 to values less than 83 and a range of logit difficulties from -1.9 to values less than -1.7. In fact, there is a 2point range (1 point below the labeled value and 1 point above the labeled value) for all the columns on the item map. Within each column, items are displayed in order by item ID numbers and can be identified by color and symbol-coded test content areas. By marking item content areas of the items on the map, a representative sample of items within each content area can be rated in the standard-setting process. The goal of item mapping is to locate a column of items on the histogram where judges can reach consensus that the MCC has a .50 chance answering the items correctly. Rasch Hierarchical Measurement Method. In a study by Maier (2001) a hierarchical measurement model is developed that enables researchers to measure a latent trait variable and model the error variance corresponding to multiple levels. The Rasch hierarchical measurement model (HMM) results when a Rasch IRT model and a one-way ANOVA with random effects are combined. Item response theory models and hierarchical linear models can be combined to model the effect of multilevel covariates on a latent trait. Through the combination, researchers may wish to examine relationships between person-ability estimates and person-level and contextual-level characteristics that may affect these ability estimates. Alternatively, it is also possible to model data obtained from the same individuals across repeated questionnaire administrations. It is also made possible to study the effect of person characteristics on ability estimates over time. Advantages of the IRT The benefits of the item response theory is that its treatment of reliability and error of measurement through item information function are computed for each item (Lord, 1980). These functions provide a sound basis for choosing items in test construction. The item information function takes all items
125
parameters into account and shows the measurement efficiency of the item at different ability levels. Another advantage of the item response theory is the invariance of item parameters which pertains to the sample-free nature of its results. In the theory the item parameters are invariant when computed in groups of different abilities. This means that a uniform scale of measurement can be provided for use in different groups. It also means that groups as well as individuals can be tested with a different set of items, appropriate to their ability levels and their scores will be directly comparable (Anastasi & Urbina, 2002). Scaling Models Measurement essentially is concerned with the methods used to provide quantitative descriptions of the extent to which individuals manifest or possess specified characteristics” (Ghiselli, Campbell, & Zedeck, 1981, p. 2). “Measurement is the assigning of numbers to individuals in a systematic way as a means of representing properties of the individuals” (Allen & Yen, 1979, p. 2). “‘Measurement’ consists of rules for assigning symbols to objects so as to (1) represent quantities of attributes numerically (scaling) or (2) define whether the objects fall in the same or different categories with respect to a given attribute (classification)” (Nunnally &Bernstein, 1994, p. 3). There are important aspects to consider in the process of measurement in psychometrics. First, it is needed to quantify an attribute of interest. That is, there are numbers to designate how much (or little) of an attribute an individual possesses. Second, attribute of interest must be quantified in a consistent and systematic way (i.e., standardization). That is, when the measurement process is replicated, it is systematic enough that meaningful replication is possible. Finally, attributes of individuals (or objects) are measured not the individuals per se. Levels of Measurement As the definition of Nunnally and Bernstein (1994) suggests, by systematically measuring the attribute of interest individuals can either be classified or scaled with regard to the attribute of interest. Engaging in classification or scaling depends in large part on the level of measurement used to assess a construct. For example, if the attribute is measured on a nominal scale of measurement, then it is only possible to classify individuals as falling into one or another mutually exclusive category (Agresti & Finlay, 2000). This is because the different categories (e.g., men versus women) represent only qualitative differences. Nominal scales are used as measures of identity (Downie & Heath, 1984). When gender are coded such as males coded 0, females 1 that does not mean that these values have any quantitative meaning. They are simply labels for gender categories. At the nominal level of measurement, there are a variety of sorting techniques. In this case, subjects are asked to sort the stimuli into different categories based on some dimension. There are some data that reflect rank order of individuals or objects such as a scale evaluating the beauty of a person from highest to lowest (Downie & Heath, 1984). This would represent an ordinal scale of measurement where objects are simply rank ordered. It does not provide how much hotter one object is than another, but it can be determined that that A is hotter than B, if A is ranked higher than B. At the ordinal level of measurement, the Q-sort method, paired comparisons, Guttman’s Scalogram, Coomb’s unfolding technique, and a variety of rating scales can be used. The major task of subject is to rank order items from highest to lowest or from weakest to strongest. The interval scale of measurement have equal intervals between degrees on the scale. However, the zero point on the scale is arbitrary; 0 degrees Celsius represents the point at which water freezes at sea level. That is, zero on the scale does not represent “true zero,” which in this case would mean a
126
complete absence of heat. In determining the area of a table a ratio scale of measurement is used because zero does represent “true zero”. When the construct of interest is measured at the nominal (i.e., qualitative) level of measurement, objects are only classified into categories. As a result, the types of data manipulations and statistical analyses that can be perform on the data is very limited. In cases of descriptive statistics, it is possible to compute frequency counts or determine the modal response (i.e., category), but not much else. However, if it were at least possible to rank order the objects based on the degree to which the construct of interest possess, then it is possible to scale the construct. In addition, higher levels of measurement allow for more in-depth statistical analyses. With ordinal data, for example, statistics such as the median, range, and interquartile range can be computed (Downie & Heath, 1984). When the data is interval-level, it is possible to calculate statistics such as means, standard deviations, variances, and the various statistics of shape (e.g., skewness and kurtosis). With interval-level data, it is important to know the shape of the distribution, as different-shaped distributions imply different interpretations for statistics such as the mean and standard deviation. At the interval and ratio level of measurement, there is direct estimation, the method of bisection, and Thurstone’s methods of comparative and categorical judgments. With these methods, subjects are asked not only to rank order items but also to actually help determine the magnitude of the differences among items. With Thurstone’s method of comparative judgment, subjects compare every possible pair of stimuli and select the item within the pair that is the better item for assessing the construct. Thurstone’s method of categorical judgment, while less tedious for subjects when there are many stimuli to assess in that they simply rate each stimulus (not each pair of stimuli), does require more cognitive energy for each rating provided. This is because the SME must now estimate the actual value of the stimulus. Unidimensional Scaling Models Psychological measurement is typically most interested in scaling some characteristic, trait, or ability of a person. It determines how much of an attribute of interest a given person possesses. This will allow to estimate the degree of inter-individual and intra-individual differences among the subjects on the attribute of interest. There are various ways of scaling such as scaling the stimuli given to individuals, as well as the responses that individuals provide. Scaling for a Stimuli (Psychophysics) Scaling of stimuli is more prominent in the area of psychophysics or sensory/perception psychology that focuses on physical phenomena and whose roots date back to mid–19th century Germany. It was not until the 1920s that Thurstone began to apply the same scaling principles to scaling psychological attitudes. In the process of psychophysical scaling one factor is held constant (e.g., responses), collapse across a second (e.g., stimuli), and then scale the third (e.g., individuals) factor. With psychological scaling, however, it is typical to ask participants to provide their professional judgment of the particular stimuli, regardless of their personal feelings or attitudes toward the topic or stimulus. This may include ratings of how well different stimuli represent the construct and at what level of intensity the construct is represented. In scaling for stimuli, research issue frequently concern the exact nature of functional relations between scaling of the stimuli in different circumstances (Nunnaly, 1970). There are variety of ways on scaling for stimuli through psychophysical method. Psychophysical methods examine the relationship between the placement of objects on the two scales and attempts to establish principles or laws that connect the two (Roberts, 1999). The following psychophysical method includes rank order, constant stimuli and successive categories.
127
(1) Method of Adjustment - An experimental paradigm which allows the subject to make small adjustments to a comparison stimulus until it matches a standard stimulus. The intensity of the stimulus is adjusted until target is just detectable. (2) Method of Limits – Adjust intensity in discreet steps until observer reports that stimulus is just detectable. (3) Method of Constant Stimuli – Experimenter has control of stimuli. Several chosen stimulus values are chosen to bracket the assumed threshold. Stimulus is presented many times in random order. Psychometric function is derived from proportion of detectable responses. (4) Staircase Method – To determine a threshold as quickly as possible. Compromise between the method of limits and method of constant stimuli. (5) Method of Forced Choice (2AFC) – Observer must choose between two or more options. Good for cases where observers are less willing to guess. (6) Method of Average Error – The subject is presented with a standard stimuli. The subject then undergoes trials to target the stimulus presented. (7) Rank order – requires the subject to rank stimuli from most to least with respect to some attribute of judgment or sentiment. (8) Paired comparison – a subject is required to rank a stimuli two at a time in all possible pairs. (9) Successive categories – the subject is asked to sort a collection of stimuli into a number of distinct piles or categories, which are ordered with respect to a specified attribute. (10) Ratio judgment – The experimenter selects a standard stimulus and a number of variable stimuli that differ quantitatively from the standard stimulus on a given characteristic. The subjects selects from the range of variable stimuli, the stimulus whose amount of the given characteristic corresponds to the ratio value. (11) Q sort – subjects are required to sort the stimuli into an approximate normal distribution, with its being specified how many stimuli are to be placed in each category.
Scaling for People (Psychometrics) Many issues arise when performing a scaling study. One important factor is who is selected to participate in the study. Many stimuli or scaling involve some psychological (latent) dimension of people without any connection to a direct counterpart "physical" dimension. When people (psychometrics) are scaled, it is typical to obtain a random sample of individuals from the population to generalize. With psychometrics participants are asked to provide their individual feelings, attitudes, and/or personal ratings toward a particular topic. In doing so, one is able to determine how individuals differ on the construct of interest. With stimulus scaling, however, the researcher would sum across raters within a given stimulus (e.g., question) in order to obtain rating(s) of each stimulus. Once the researcher is confident that each stimulus did, in fact, tap into the construct and had some estimate of the level at which it did so, only then should the researcher feel confident in presenting the now scaled stimuli to a random sample of relevant participants for psychometric purposes. Thus, with psychometrics, items (i.e., stimuli) are summed across within an individual respondent in order to obtain his or her score on the construct. The major requirement in scaling for people is that variables should be monotonically related to each other. A relationship is monotonic if higher scores in one scale correspond to higher scores on another scale, regardless of the shape of the curve (Nunnally, 1970). In scaling for people many items on a t4st is used to minimize measurement error. The specificity of items can be averaged when they are combined. By combining items, one can make relatively fine distinctions between people. The problem of
128
scaling people with respect to attributes is then one of collapsing responses to a number of items as to obtain one score for each person. One variety of scaling for people is the deterministic model and it assumes that there is no error in item trace lines. Trace lines shows that a high level of ability would have a probability close to 1.0 of correctly obtaining a response. The model assumes that up to a point on the attribute, the probability of response alpha is zero and beyond that point the probability of response alpha is 1.0. Each item has a biserial correlation of 1.0 with the attribute , and consequently each item perfectly discriminates at a particular point of the attribute. There are varieties of scaling models for people that includes Thurstone, Lickert scale, Guttman scale, and Semantic differential scaling. (1) Thurstone scaling – There are 300 or so judges to rate 100 statements on a particular issue on an 11 point scale. A subset of statements are then shown to respondents and their score is the mean of the ratings for the statement they select. (2) Lickert scale - Respondents are request to state their level of agreement with a series of attitude statements. Each scale point is given a value (say, 1- 5) and the person is given the core corresponding to their degree of agreement. Often a set of Likert items are summed to provide a total score for the attitude. (3) Guttman scale - It involves producing a set of statements that form a natural hierarchy. Positive answers to the item at one point on the hierarchy assume positive answers to all the statements below (e. g. disability scale). Gets over problem of item totals being formed by different sets of responses. Scaling Responses The third category of responses, which is said to typically hold constant, also needs to be identified. That is, a decision is arrived in what fashion will the subjects respond to a stimuli. Such response options may include requiring participants to make comparative judgments (e.g.,which is more important, A or B?), subjective evaluations (e.g., strongly agree to strongly disagree), or an absolute judgment (e.g., how hot is this object?). Different response formats may well influence how to write and edit stimuli. In addition, they may also influence how one evaluates the quality or the “accuracy” of the response. For example, with absolute judgments, standard of comparisons are used, especially if subjects are being asked to rate physical characteristics such as weight, height, or intensity of sound or light. With attitudes and psychological constructs, such “standards” are hard to come by. There are a few options (e.g., Guttman’s Scalogram and Coomb’s unfolding technique) for simultaneously scaling people and stimuli, but more often than not only one dimension is scaled at a time. However, a stimuli is scaled first (or seek a wellestablished measure) before having confidence in scaling individuals on the stimuli. Multidimensional Scaling Models With unidimensional scaling, as described previously, subjects are asked to respond to stimuli with regard to a particular dimension. With multidimensional scaling (MDS), how-ever, subjects are typically asked to give just their general impression or broad rating of similarities or differences among stimuli. Subsequent analyses, using Euclidean spatial models, would “map” the products in multidimensional space. The different multiple dimensions would then be “discovered” or “extracted” with multivariate statistical techniques, thus establishing which dimensions the consumer is using to distinguish the products. MDS can be particularly useful when subjects are unable to articulate “why” they like a stimulus, yet they are confident that they prefer one stimulus to another.
129
References Agresti, A. & Finlay, B. (1997). Statistical methods for the social sciences (3rd ed.). New Jersey: Prentice Hall. Anastasi, A. & Urbina, S. (2002). Psychological testing. Prentice Hall: New York. Andrich, D. (1998). Rasch models for measurement. Sage University: Sage Publications. Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education. Bejar, I. I. (1983). Subject matter experts' assessment of item statistics. Applied Psychological Measurement, 7, 303-310. Berk, R. A. (1996). Standard setting: the next generation. Applied Measurement in Education, 9, 215-235. Chang, L. (1999). Judgmental item analysis of the Nedelsky and Angoff standard-setting methods. Applied Measurement in Education, 12, 151-166. Crocker, L. M., & Algina, J. (1986). Introduction to classical and modern test theory. Belmont, CA: Wadsworth. Dobby J, & Duckworth, D (1979): Objective assessment by means of item banking. Schools Council Examination Bulletin, 40, 1-10. Downie, N.M., & Heath, R.W. (1984). Basic statistical methods (5th ed.). New York: Harper & Row Publishers. Fischer, G. H. (1974) Derivations of the Rasch Model. In Fischer, G. H. & Molenaar, I. W. (Eds) Rasch Models: foundations, recent developments and applications, pp. 15-38 New York: Springer Verlag. Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measurement theory for the behavioral sciences. New York: W. H. Freeman. Guildford, J. P. (1954). Psychometric methods. New York: McGraw-Hill. Goodwin, L. D. (1999). Relations between observed item difficulty levels and Angoff minimum passing levels for a group of borderline candidates. Applied Measurement in Education, 12, 13-28. Grosse, M. E., & Wright, B. D. (1986). Setting, evaluating, and maintaining certification standards with the Rasch model. Evaluation and the Health Professions, 9, 267-285. Hambelton, R. K. (2000). Emergence of item response modeling in instrument development and data analysis. Medical Care, 38, 60-65.
130
Impara, J. C., & Plake, B. S. (1997). Standard setting: An alternative approach. Journal of Educational Measurement, 34, 353-366. Impara, J. C., & Plake, B. S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35, 69-81. Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64, 425-461. Kaplan, R. M. & Saccuzo, D. P. (1997). Psychological testing: Principles, applications and issues. Pacific Grove: Brooks Cole Pub. Company. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Magno, C. (2003). Relationship between attitude towards technical education and academic achievement in mathematics and science of the first and second year high school students, caritas don bosco school, sy 2002 – 2003. An unpublished master’s thesis, Ateneo de Manila University, Quezon City, Manila. Maier, K. S. (2001). A Rasch hierarchical measurement model. Journal of Educational and Behavioral Statistics, 26, 307-331. McKinley, D. W., Newman, L. S., & Wiser, R. F. (1996, April). Using the Rasch model in the standardnetting process. Paper presented at the annual meeting of the National Council of Measurement in Education, New York, NY. Mills, C. N., & Melican, G. J. (1988). Estimating and adjusting cutoff scores: Future of selected methods. Applied Measurement in Education, 1, 261-275. National Academy of Education (1993). Setting performance standards for student achievement. Stanford, CA: Author. National Research Council (1999). Setting reasonable and useful performance standards. In J. W. Pellegrino, L. R. Jones, & K. J. Mitchell (Eds.), Grading the nation's report card: Evaluating NAEP and transforming the assessment of educational progress (pp. 162-184). Washington, DC: National Academy Press. Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of mathematical psychology, 3, 1 – 18. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.) New York: McGraw-Hill. Plake, B. S. (1998). Setting performance standards for professional licensure and certification. Applied Measurement in Education, 11, 65-80. Reid, J. B. (1991). Training judges to generate standard-setting data. Educational Measurement: Issues and Practice, 10, 11-14.
131
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Danish Institute for Educational Research. Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. In G. M. Copenhagen (ed.). The Danish yearbook of philosophy (pp.58-94). Munksgaard. Shepard, L. A. (1995). Implications for standard setting of the national academy of education evaluation of the national assessment of educational progress achievement levels. Proceedings of Joint Conference on Standard Setting for Large-Scale Assessments (pp. 143-160). Washington, DC: The National Assessment Governing Board (NAGB) and the National Center for Education Statistics (NCES). Shepard, L. A. (1995). Implications for standard setting of the national academy of education evaluation of the national assessment of educational progress achievement levels. Proceedings of Joint Conference on Standard Setting for Large-Scale Assessments (pp. 143-160). Washington, DC: The National Assessment Governing Board (NAGB) and the National Center for Education Statistics (NCES). Spearman,, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72 – 101. Torgerson, W. S. (1958). Theory and methods of scaling. New York: Wiley. Thurstone, L. L. (1927). The unit of measurements in educational scales. Journal of Educational Psychology, 18, 505 – 524. Wang, N. (2003). Use of the Rasch IRT model in standard setting: An item-mapping method. JEM, 40, 231. Van der Linden, W. J. & Hambleton, R. K. (2004). Item response theory: Brief history, common models, and extension. New York: Mc Graw Hill. van der Ven, A. H. G. S. (1980). Introduction to scaling. New York: Wiley. Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MESA Press. Wright BD, Panchapakesan N (1969). A procedure for sample free item analysis. Educational and Psychological Measurement, 29, 23-48.
132
Exercise:
Calibrate the item difficulty and person ability of the scores in a Reading Comprehension test with 19 items among 15 Korean students. After performing he Rasch Model, determine item difficulty using the Classical test theory approach. Compare the results. Case
Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
Item 7
Item 8
Item 9
Item 10
Item 11
Item 12
Item 13
Item 14
Item 15
Item 16
Item 17
Item 18
Item 19
A
0
1
1
1
0
1
1
0
0
1
1
1
0
0
0
0
1
0
1
B
0
0
1
1
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
C
0
0
0
1
0
1
0
0
0
0
1
1
0
1
1
0
1
0
1
D
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
E
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
F
0
0
0
1
0
0
1
0
1
1
0
0
0
1
0
0
1
1
0
G
1
0
0
1
0
1
0
0
0
0
1
1
0
0
0
0
0
0
1
H
0
0
1
1
0
1
0
0
0
0
0
1
1
0
0
0
0
1
1
I
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
J
0
0
1
1
0
0
0
0
1
0
1
0
0
1
1
0
1
1
1
K
1
0
0
1
0
0
0
0
0
1
0
1
1
1
0
1
0
1
0
L
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
1
M
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
0
1
0
1
N
0
0
1
1
1
1
1
1
1
0
0
1
0
1
0
0
1
0
1
O
0
0
0
0
0
1
0
0
1
0
0
1
0
0
0
0
1
0
1
References
Anastasi, A. & Urbina, S. (2002). Psychological testing (7th ed.). NJ: Prentice Hall. DiLeonardi, J. W. & Curtis, P. A. (1992). What to do when the numbers are in: A users guide to statistical data analysis in the human services. Chicago IL, Nelson-Hall Inc. Kaplan, R. M. & Saccuzzo, D. P. (1997). Psychological testing: Principles, applications, and issues (4th ed.). Pacific Grove, USA: Brooks/Cole Publishing Company. Magno, C. (2007). Exploratory and confirmatory factor analysis of parental closeness and multidimensional scaling with other parenting models. The Guidance Journal, 36, 63-89. Magno, C., Lynn, J, Lee, K., & Kho, R. (in press). Parents’ School-Related Behavior: Getting Involved with a Grade School and College Child. The Guidance Journal.
133
Magno, C., Tangco, N., & Tan, C, (2007). The role of metacognitive skills in developing critical thinking. Paper presented at the Asian Association of Social Psychology in Universiti Malaysia, Kota Kinabalu, Sabah Malaysia, July 25 to 28. Payne, D. A. (1992). Measuring and evaluating educational outcomes. MacMillan Publishing Company: New York. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Danish Institute for Educational Research. Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. In G. M. Copenhagen (ed.). The Danish yearbook of philosophy (pp.58-94). Munksgaard. Van der Linden, W. J. & Hambleton, R. K. (2004). Item response theory: Brief history, common models, and extension. New York: Mc Graw Hill. Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MESA Press.
134
Chapter 4 Developing Teacher-made Tests Objectives
1. 2. 3. 4. 5.
Explain the theories and concepts that rationalize the practice of assessment. Make a table of specification of the test items. Design pen-and-paper tests that are aligned to the learning intents. Justify the advantages and disadvantages of any pen-and-paper test. Evaluate the test items according to the guidelines.
Lessons
1 2
3
The Test Blueprint Designing Selected-Response Items Binary-choice items Multiple-choice items Matching items Designing Constructed-Response Items Short-answer items Essay items
135
Lesson 1 The Test Blueprint
As we have mentioned in the previous chapters, teaching involves decision-making. This chapter discusses another aspect of teaching that requires intelligent and informed decisions from teachers. In this chapter, we wish to provide teachers with the basic scaffolds for developing penand-paper tests in the hope of meeting the following objectives: As the term suggests, teacher-made tests are assessment tools, particularly pen-and-paper types, that teachers develop, use, and assess based on the learning targets set of the task or domain to be tested. It makes use of content, knowledge, as well as process domains. The content domain is the subject area from which items are drawn. In its general sense, it is the subject or course (i.e., science, math, English, etc.) in which testing is to be made. Specifically, it covers specific topics under a subject area (i.e., the laws of motion, addition of fractions, or use of a singular verb in a sentence). The knowledge domain involves those dimensions or types of knowledge to be tested. In the revised taxonomy, this domain involves those knowledge dimensions as factual, conceptual, procedural, and metacognitive knowledge types. As for the process domain, any pen-and-paper test involves the aspects of mental processes that students use to engage the task in the test. In the revised taxonomy, those mental procedures as remembering, understanding, applying, and so on, are the processes that may be tested.
A. Call to mind those alternative taxonomic tools in Chapter 2. B. Identify the knowledge domain and the process domain of each alternative taxonomy. C. Monitor your understanding by clearly accounting for what you already know about these domains, or by figuring out those areas that you do not understand yet. D. Formulate questions regarding what you wish to clarify about the matter that you do not clearly understand. E. When appropriate, raise your questions in class or discuss with your classmates.
To make sure that you have these domains accounted for in your assessment design, engage yourself to make a table of specification, one that will allow you to explicitly indicate what content to cover in your test, what knowledge dimensions to focus, and what cognitive processes to pay attention to. The Table of Specification is a matrix where the rows consist of the specific topic or skills (content or skill areas) and the columns are the learning behaviors or competencies that we desire to measure. Although e can also add more elements in the matrix, such as Test Placement, Equivalent Points, or Percent values of items, the conventional prototype table of specification may looks like this:
136
Cognitive Processes Content (or Skill) Areas
Knowledge Application Analysis
TOTAL
1. Translation from words to mathematical symbols 2. Forming the Linear Equation 3. Solving the Linear Equation 4. Checking the Answer
1
2
2
5
1
3
2
6
3
1
4
3
2
5
TOTAL
2
11
7
20
The number of items measuring Knowledge
Number of items for solving linear equation, measuring Application
The total number of test items
As you have seen in the above table of specification, only three cognitive processes are indicated. This means that if you use the old Bloom’s taxonomy of behavioral objectives, include only those levels that you wish to measure in the test, although it is recommended that more than a single processes should be measured in a test, depending, of course, on your purpose of testing. As a test blueprint, the table of specification ensures that the teacher sees all the essential details of testing and measuring student learning. It makes the teacher sure that the content areas (or skill areas) and the levels of behavior in which learning is hoped to anchor are measured. The test’s degree of difficulty may also be seen in the table of specification. When the distribution of test items is concentrated in the higher-order cognitive behaviors (analysis, synthesis, evaluation), the test’s difficulty level is higher as compared to when the items are concentrated in the lower-order cognitive behaviors (knowledge, comprehension, application). As you have learned in Chapter 2 of this book, there are many taxonomic tools that may be used in our instructional planning. The taxonomic tool for planning the test should be consistent to the taxonomy of learning objectives used in the overall instructional plan. Understandably, designing the table of specification using any taxonomic tool will require a little of our time, effort, and other personal and motivational resources. Before we may be tempted to develop pen-and-paper test items without first preparing our table of specification, and run the risk of not actually evaluating our students on the basis of our learning intents, we need to first brush up on our understanding of the instrumental function of the table of specification as a blueprint for our test, and convince ourselves that this is an important process in any test development activity. In developing the table of specification, we suggest that you do not yet think of the types of pen-and-paper test you wish to give. Instead, you just focus on planning of your test in terms of your assessment domain.
137
Lesson 2 Designing Selected-Response Items
When you are done with your test blueprint, you are now ready to start developing your test items. For this phase of test development, you will need to decide what types or methods of pen-and-paper assessment you wish to design. To aid you in this process, we will now discuss some of the common pen-and-paper types of test and the basic guidelines in the formulation of items for each type. In deciding about the assessment method to use for a pen-and-paper test, you choose which among the selected response or the constructed response types would be appropriate for your blueprint. Selected response tests use those types of items that require the test takers to respond by choosing an option from a list of alternatives. Common types of constructed response tests are binary-choice, multiple-choice, and matching tests.
Binary-choice Items
The binary-choice test offers students the opportunity to choose between two options for an answer. The items must be responded to by choosing one of two categorically distinct alternatives. The true-false test is an example of this type of selected-response test. This type of selected-response test typically contains short statements that represent less complex propositions, and therefore, is efficient in assessing certain levels of students’ learning in a reasonably short period of testing time. In addition to this, a binary-choice test may cover a wider content area in a brief assessment session (Popham, 2005). To assist you in developing binary-choice items, here are some guidelines with brief descriptions of each. These guidelines may not capture everything that you need to be mindful of in developing teacher-made tests. These are just the basics of what you need to know. It is important that you also explore on other aspects of test development, including the context in which the test is to be used, among others.
Make the instructions explicit Basic in pen-and-paper test is that instructions indicate the task that students need to do and the credit they can earn from making every correct answer. However there is one more thing you need to indicate in your instructions for a binary-choice test – the reference of validity or reference of truth. When you ask your students to judge whether the statement is true or false, correct or erroneous, or valid or invalid, you need to state the reference of truth or correctness of a response. If the reference is a reading article, textbook, teacher’s lecture, class discussion, or resource person, state it in your instructions. This will help students think contextually and ontrack. This will also help you cluster your items according to specific domain or context. Also, it can minimize the problem of conflict of information, such as one resource material says this and one person (maybe your student’s parent or another teacher) says otherwise. For items that vary in context and reference of truth, state the reference in the item itself. For example, if the item is drawn from a person’s opinion, such as the principal’s speech, or a guest speaker’s ideas, it is
138
important that you attribute to opinion to its source. Lastly, although not a must, it might be nicer to use “please” and “thank you” in our test instructions. State the item as either definitely true or false Statements must be categorically true or false, and not conditionally so. It should clearly communicate the quality of the idea or concept as to whether it is true, correct, and valid or false, erroneous, and invalid. Make sure that it clearly corresponds to the reference of validity and that the context must be explicit. For the quality to be categorical, it must invite only a judgment of contradictories, not contraries. For example, white or not white implies a contradiction because one idea is a denial of the other. To say black or white indicates opposing ideas that imply values between them, such as gray. A good item is one that implies only contradictory, mutually exclusive qualities, that is, either true or false, and it does not need further qualification in order to make it true or false.
Keep the statements short, simple, but comprehensible In formulating binary-choice items, it is wise to consider brevity in the statement. Good binary-choice items are concisely written so that they present the ideas clearly but avoid extraneous materials. Making the statements too long is risky in that it might unintentionally indicate clues that will make your statement obviously true or false. There is actually no clear-cut rule for brevity. It is usually left to the teacher’s judgment. In preparing the whole binary-choice test, it is also important that all the items or statements maintain relatively the same length. For a statement to be comprehensible, it must make a clear sense of the ideas or concepts on focus, which is usually lost when a teacher lifts a statement from a book and use it as a test item statement. Do away with tricks We remember that the purpose of assessing our students’ learning is based on the assessment objectives we set. Clearly, solving tricks is remote if not totally excluded in our intents. Therefore, we need to avoid using tricks, such as using double-negatives in the statement or switching keys. The use of double-negative statements is a logical trickery because the “valence” of the statement is still maintained, not altered. These statements are usually puzzling, and will therefore take more time for students to understand. Switching keys is when you ask students to answer “false” if the statement is true, or “true” if the statement is false. This is obviously an unjustifiable trick. By all means, we have to avoid using any kind of tricks not only in binary-choice tests but also in all other types and methods of assessment. Get rid of those clues Clues come in different forms. One of the common clues that can weaken validity and reliability of our assessment is comes from our use of certain words, such as those that denote universal quantity or definite degree (i.e., all, everyone, always, none, nobody, never, etc.).
139
These words are usually false because it is almost always wrong to say that one instance applies to all sorts things. Other verbal clues may come from the use of terms that denote indefinite degree (i.e., some, few, long time, many years, regularly, frequently, etc.). These words do not actually indicate a definite quantity or degree, and, thus, violate the rule on definiteness of quality in letter b. Other clues may come from the way statements are arranged according to the key, such as alternating items that are true and false, or any other style of placing the items in a systematic and predictable order. This should be avoided because once the students notice the pattern, they are not likely to read the items anymore. Instead, they respond to all items mindlessly but obtain high scores. Basic in test development is our mindful tracking of our purpose. Binary-choice items can be a useful tool for assessing learning intents that are drawn from various types of knowledge, but include only simpler cognitive processes. In this test, students only recall their understanding of the subject matter covered in assessment domain. They do not manipulate this knowledge by using more complex, deeper cognitive strategies and processes. Another important point to consider in deciding whether to use the binary-choice test is its degree of difficulty. Because this type of test offers only two options, the chance that a student chooses the correct option is 50%, the remainder is his chance of choosing the wrong option. This 50-50 probability of selecting the correct answer is problematic because the chance of answering the question correctly is high even if the student is not quite sure of his understanding. One way of reducing the likelihood of guessing for the right option is suggested by Popham (2004), that is to include more items because if students are successful in their guesswork for a 10-item binary-choice test, it is likely impossible to maintain this success with, let us say, a 30-item test.
Think of a subject matter in your area of specialization, something that you have deep and wide knowledge about. Think of a competency that can be tested by using a binarychoice type of assessment. Do this by formulating a statement of learning intent. Convince yourself as to why the binary-choice test can be used to test the competency.
Multiple-choice Items
Multiple-choice test is another selected-response type where students respond to every item by choosing one option among a set of three to five alternatives. The item begins with a stem followed by the options or alternatives. This type of pen-and-paper test has been widely
140
used in national achievement tests and other high stake assessment, such as the professional board examinations. Perhaps, the reason for this is because multiple-choice test is capable of measuring a range of knowledge and cognitive skills, obviously more than what other types of objective tests can do. A multiple-choice test may come in two types. The correct-answer type is one whose items pose specific problems with only one correct answer from the list of alternatives. For instance, if a stem is followed by four alternatives, only one of them is correct (the keyed answer), and the other three are incorrect. In this type of multiple-choice test, all items should be designed in this fashion. The other is the best-answer type where the stem establishes a problem to be answered by choosing one best option. Understandably, the other options are acceptable but not necessarily the best alternatives to answer the problem posed in the stem. In this type of multiple-choice test, only one option is the best answer (keyed answer), and the others may all be conditionally acceptable, or some are acceptable, some others are totally incorrect. To guide you in formulating good multiple-choice items, here are some fundamental guidelines that will be helpful in going through the process.
Make the instructions explicit When giving a multiple-choice test, the instructions must indicate the content area or context, the ways in which students respond to every item, and the scoring. If you are using the correct answer-answer type, it is helpful to the students if your instructions state that they “choose the correct answer”. Common sense should tell us not to use this expression when our multiple-choice test is of best-answer type, but “choose the best answer” would be more appropriate. Lastly, you may want to say “please” and “thank you.”
Formulate a problem As mentioned above, every item in a multiple-choice test has a stem and a set of alternatives. The stem should clearly formulate a problem. This is to compel students to respond to it by choosing one option that will correctly answer the problem or best address it. There are two ways of posing a problem in the stem of multiple-choice test. One way is by formulating a question or and interrogative statement. If the stem is “In what year did the first EDSA revolution happen?” it clearly poses a problem to be answered than “The year when the first EDSA revolution happen.” The other way to pose a problem in the stem is by formulating an incomplete sentence where one of the options correctly or best completes it. It may be phrased as “The first EDSA revolution happened in the year” then the statement is followed by the list of alternatives. As you will also learn about completion types in the subsequent section of this chapter, when you use the incomplete sentence format to pose a problem in the stem of a multiple-choice item, always remove a keyword at the end of the statement, or at least near the end. If the keyword is at the end of the statement, you don’t end with any punctuation mark or a blank space. If the missing keyword is near the end of the statement but not necessarily the last word, replace that keyword which you removed with an underlined blank space, and end your statement with an appropriate punctuation mark.
141
State the stem in positive form Ask yourself, how reasonable is it for you to state your item’s stem in a negative form? or how important is assessing students’ ability to deal with the “negatives” in your test?, you surely struggle to seek a good answer that justifies the use of negative statements in your multiplechoice test. One of the common problems we encounter in a negatively phrased stem is the high chance of not spotting the word that carries the negation (e.g., not). Another is the difficulty in anchoring the negative item to the learning intent. In general case, “which one is” will work more effectively in assessing students’ learning than “which one is not.” The rule-of-thumb says that you avoid the use of negative statements, unless there is a really compelling reason for why you will need to phrase your stem in a negative form. If this reason is reasonable enough, you need to highlight the word that carries the negation, such as writing “not” as “not,” “NOT,” “not,” or not.
Include only useful alternatives Remember that the set of alternatives following the stem is a list of options from which students pick out their response. In any type of multiple-choice test, only one alternative is keyed, and the rest are distractors. The keyed alternative is ultimately useful because it is what we expect every student who learned the subject matter should choose. If the set of alternatives does not contain the expected answer, it is clearly a bad item. This problem is more dreadful in a correct-answer type than in the best-answer type. At least for the latter, the second best alternative can stand as the key if the best answer is missing in the list. If the correct answer is missing in the list of options in a correct-answer type, then there is really no answer to the problem posed in the stem, and must be removed from the test. Even if the distractors are not the expected answers, they serve an important function in the multiple-choice test. As distractors, they should distract those students who do not learn enough the subject matter, but not those who learn. Therefore, these distractors should be plausible appear as if they are correct or best options. The way plausible distractors work in a multiple-choice test is by making the students believe that these distractors are the correct or best answer even if they are actually not. An important consideration in dealing with the alternatives is maintaining a homogeneous grouping. For example, if a stem asks about the name of a particular inventor in science, all alternatives should be names of scientific inventors. As stated above, a multiple-choice item should have three to five alternatives. Choosing to include 3, or 4, or 5, depends on the grade level or year level of the class of students you are handling. We suggest that higher grade- or year-level students be given items with more than 3 options as this will increase the level of test difficulty and reduce the effects of guessing on your assessment of students’ learning. In instances when you wish that students evaluate each option as to is plausibility, you may add the option “none of the above” as the fourth or fifth alternative. However, you have to use this alternative with caution. Use it only for correct-answer type of multiple-choice test and when you intend to increase the difficulty of an item and that the
142
presence of this option will help you come up with a better inference of your students’ learning. Let us say, for example, you are testing computational skills of your students using multiplechoice items and you encourage mental computations as they deal with the item. If you give them only number options, they may just choose any one option based on simple estimation, believing that one of them is the correct answer. Adding the “none of the above” option will encourage students to do mental computation to check on each option’s correctness, because they know it is possible that the correct answer is not in the list. Obviously, you cannot use this option in a best-answer type of multiple-choice test. The option “all of the above” should never be used at all as this can invite guessing that will work for your students. If your last option (4th or 5th) is “all of the above” and your students notice at least 2 options that are correct, they are likely to guess that “all of the above” is the correct option. Similarly, if they spot one incorrect option, automatically they disregard the “all of the above” option. When they do so, the item’s difficulty is reduced. One of the instances that teachers are tempted to unreasonably use “none of the above” or include “all of the above” option even if it is not allowed, is when they force themselves to maintain the same number of alternatives for their multiple-choice items. In this case, they use these alternatives as “fillers” in case they run out of options to maintain its number in all the items. In order to avoid this mistake, it is important to realize that, for classroom testing purposes, multiple-choice items do not have to come with the same number of options for all its items. It is okay to have some items with four options while some other items have five.
Scatter the positions of keyed answers In formulating your multiple-choice items, spread the keyed answers to different response positions (i.e., a, b, c, d, and e). Make sure the number of items whose keyed answer is “a” is proportional to the items keyed for each of the other response positions. Better yet, if you give a 20-item multiple-choice test with 4 options per item, key five items to each response position (25% of items per response position or approximately so). The good thing about multiple-choice test is that it is capable of measuring skills higher than just recall or simple comprehension. If properly formulated, the test can measure higher level thinking (Airasian, 2000). Also, the fact that every item in the multiple-choice test is followed by more than two response options makes it obtain its reputation of having higher difficulty level because the probability that one option is correct becomes smaller as you increase the number of options. Certainly, a 4-option item is more difficult than a 3-option item because the former indicates only a 25% probability that one option is correct, which is lower than 33% probability for a 3-option item. A 5-option item is clearly more difficult.
A. Think of a specific subject matter in your field of specialization, one that you are very familiar with. B. Write a learning intent that can be measured using multiple-choice test. C. Formulate at least 5 correct-answer-type items and another 5 best-answer-type items. D. Check the quality of your output based on the guidelines discussed above. As you do this, monitor your learning as well as your confusions, doubts or questions. E. Raise questions in class.
143
Matching Items
Another common type of selected-response test is the matching type of test that comes with two parallel lists (i.e., premise and response), where students match the entries on one list with those in the other list. The first list consists of descriptions (words or phrases), each of which serves as a premise of test item. Therefore, each premise is taken as a test item, and must be numbered accordingly. Each premise will be matched with the entries in the second (or response) list. There is only one and the same response list for all the premises in the first list. In developing good matching items, it is helpful to consider the following hints that will guide you in the process of designing your lists.
Make instructions explicit In making your instructions for a matching test explicit, the context, task, and scoring must be clearly indicated. For its context, your instructions must introduce the description as well as the response lists. If, for example, your description list contains premises about scientific inventions you must state in your instructions that the first list (or first column) is about scientific inventions. If your response list contains names of scientific inventors, you must also state in your instructions that the second list (or second column) contains names of scientific inventors. You may phrase it something like this: “In the first column are scientific inventions. The second column lists names of scientific inventors. Match the inventions with their inventors.” Then indicate the scoring. Having said this, we suggest that your lists should be labeled with headings accordingly. In the case of the above example, you may write the column heading as “Inventions” or “Column A: Inventions” for the first column or description list, and “”Inventors” or “Column B: Inventors” for the second column or response list.
Maintain brevity and homogeneity of the lists The list of premises or descriptions must be fairly short, that is, include only those items that go together as a group. For example, if your matching test covers the common laboratory operations in chemistry, choose only those that are relevant to your assessment domain. Doing this, you are also maintaining homogeneity of your list. In matching tests, it is extremely important that entries in the description list are drawn from one relatively specific assessment domain. For example, never mix up common laboratory operations with measurements. Instead, decide as to whether you will include only one of these. The same is true for your response list. Include only those that belong to the assessment domain. Note here homogeneity in your lists is non-negotiable. Also, in writing good matching items, it is imperative that the descriptions are longer than the responses, not the other way around. After students read one of the descriptions, he reads all options in the response list. If the description is longer than each of the options, at least, the
144
student only reads it once or twice. If the entries in the response list are long, it will take up more time for the student to read all options just to respond to one description or item.
Finally, include more options than descriptions. If your description list has 10 descriptions or items, make your responses 12 or a bit more. This strategy reduces the effect of response elimination where the student already disregards those options already chosen to match the other descriptions. For example, if the student has already responded to 8 out of 10 descriptions with high confidence of his responses, so far, but finds the last 2 items difficult, with only 10 options, only 2 options are available for his choice, and therefore, each of the remaining option has a 50% probability that is it the correct option. If you include more than 10 responses, the options for the last 2 descriptions would still be more, and the probability that each option is correct is smaller than 50%. This will reduce the effect of guessing. Better yet, formulate your descriptions in a way that some options may be used more than once. In this case, you maintain the plausibility of all options for every description.
Keep the options plausible for each description Because there is only one and the same list of options for each of the descriptions, it is vital that you keep the options plausible for every description. It means that if you have ten descriptions and twelve options, one option is keyed for each description and the other eleven should be plausible distractors. Usually, if the rule on homogeneity is very well observed, it is relatively easy to maintain one list of plausible options for each description. In addition to this, never establish a systematic sequence of keyed responses, such as coding with a word, such as G-O-L-D-E-N-S-T-A-R, which means that the keyed response letter for the first description is “G” and the keyed response for the 10th description is “R.” If this pattern is initially detected by the students, such as G-O-L- _ -E- _ -S-T- _ -R, they immediately jump into guessing that the missing letters are D, N, and A, respectively (and guessing it right). Place the whole test in the same page of the test paper After stating the instructions for a matching test, write the lists or columns below it and make sure all descriptions and options are written on the same page where the matching test is placed in the test paper. Never attempt to extend some items or options in the next page of the test paper because, if you do so, students will keep flipping between pages as they respond to your matching items. If you notice in your draft that some items already go to the next page, you do some simple adjustments, like reducing the font size of your items, as long as it remains legible, or improve the efficiency of your test layout. If the problem still exists, shorten your list, or if there are other types of test in your test paper, decide to switch your matching test with your other test. The use of selected-response tests is effective in various types of learning intents and assessment contexts. With careful design, these tests can measure capabilities beyond those lower-order kinds, especially if the items are formulated to elicit students’ higher levels of cognitive skills (Popham, 2004).
145
Lesson 3 Designing Constructed-Response Types
Another set of options for the types of pen-and-paper test to give is the constructedresponse test. Unlike the selected-response types, the constructed-response test does not provide students with options for answers, but rather require students to produce and give a relevant answer to every test item. Drawing from its name, we understand that, in this type of test, students construct their response, instead of just choosing it from a given list of alternatives. Constructed-response methods of assessment include certain types of pen-and-paper tests and performance-based assessments. In this chapter we focus our discussion only on constructedresponse types of pen-and-paper test. Some of the common types of pen-and-paper constructedresponse test are short-answer and essay.
Short-answer Items
As the name suggests, short-answer items allow students to provide short answers to the questions or descriptions. This type of constructed-response test calls for students to respond to either a direct question, a specific description, or an incomplete sentence by supplying a word, a phrase, or a sentence. If a test contains direct questions, students are expected to answer the question by giving a word, a symbol, a number, a phrase, or a sentence that is being asked. The same applies to items using specific descriptions of words, phrases, or sentences. Items composed of incomplete sentences ask students to complete the every sentence by supplying the word or phrase that should meaningfully complete the sentence in terms of the assessment domain. In formulating questions or descriptions that compose your test items, it is important to always think according to the name of this test type, so that you are mindful that the items should call for “short answers.” Do not dare to ask questions that require long answer, otherwise you are using the short-answer items as essay items. If your assessment target calls for students to response with longer statements or written discussions, it is preferable that you give essay items instead of short-answer items.
Make instructions explicit Short-answer items usually have simple instructions. In fact, it is tempting to just expect that students understand how to go about the test using only their common sense. However, it is always unsafe to assume that every student understand what you want them to do with your test. Besides, it is always advisable that you give your students the necessary prompt before they respond to the test items. In short, you need to set clear instructions even for short-answer items, which should indicate the content area, the task, and scoring. In directing students on the task you expect them to do, specify if they answer the question, indicate what is described, or
146
complete the sentences, depending on your item’s format. Lastly, remember to say “please” and “thank you” in your instructions.
Decide on the item’s format When you decide to use short-answer items, also decide if all your items should come in questions, descriptions, or incomplete sentences. Whichever you decide to use to format your items, maintain consistency of the format for all your short-answer items. For example, if you wish to give a 15-item short-answer test and expect that students supply short answers to your questions, have all your items of the test written in a direct question form. Never mix up direct question items with descriptions or incomplete sentences. One important criterion for choosing what format to use is the age of the student. For younger learners, it is usually preferable to use direct questions than descriptions or incomplete sentences. Once you already make up your mind as to the item format, walk through your way to formulating each item.
Structure the items accordingly Because short-answer items call for “short answer” as may be inferred from its name, always make sure you structure every item in a way that it requires only a brief answer (i.e., a word, a symbol, a number [or a set of numbers], a phrase, or a short sentence). This is achieved by formulating very clear, specific, explicit, and error-free statements in your items. A clear and specific question calls for a specific answer. If your description clearly and explicitly represents the object that is described, and you are sure that it refers to a specific word, symbol, or phrase, then your item is structured properly. If your items are incomplete sentences, structure every item so that the missing word or phrase is a keyword or a key idea. Ordinarily, an incomplete sentence has only one blank which corresponds to one missing keyword. You may want to remove 2 keywords as long as it does not distort the key idea of the incomplete sentence which should guide the students in figuring out the missing words. Never go more than 2 blanks. One important reason why we need to ensure that students supply only brief responses is because we make sure that responses are easy to check objectively. We encounter a major problem related to scoring if students’ responses are lengthy. With long responses, it is difficult to give accurate scores. Of course, we already know, as discussed in Chapter 3, that inaccurate scoring of students’ responses in the test undermines the reliability of our measures, and reduces the validity of the inference we make on our students’ learning outcomes. Provide the blanks in appropriate places Blanks are spaces in the items where students supply their answers by writing a word, a symbol, a number, a phrase, or a sentence. If your items are all in a direct question format where each question begins with an item number, place the blanks on the left-side of the item number. When you type the item, begin with the blank space, followed by the item number, then the question. This rule also applies to items using explicit descriptions. If you are using the incomplete sentence format for your items, place the blank near the end of the sentence. This means that you take out a keyword that is found near the end of the sentence so that it becomes
147
an incomplete sentence. Never take out a keyword from the beginning of a sentence. The reason for this is that you need to first establish the key idea of the sentence so that students immediately know what is missing in the sentence right after one reading. If the blank space is near the beginning of the sentence, students will find it hard to understand the key idea and will, therefore, read the sentence more than once in order to figure out the missing word. In all item formats, always maintain the same length of the blanks in all your short-answer items. The good thing short-answer items is that students really produce a correct answer rather than merely selecting one from a set of given alternatives. In this case, if students only possess a partial knowledge of the subject matter, which usually works with selected-response items, they will find short-answer items difficult to give a correct response to every item. Although we generally recognize that these types of items are appropriate for measuring simple kinds of learning outcomes, they are capable of measuring various types of challenging outcomes if the items are carefully developed. However, it is not advisable that you force yourself to use shortanswer items to measure more complex and deeper levels of cognitive processes. It is always helpful that you know other methods of assessment so that you have a wide range of options where you can freely navigate yourselves depending on your assessment purposes.
A. Think of a specific subject matter in your field of specialization, one that you are very familiar with. B. Write a learning intent that can be measured using short-answer test. C. Formulate at least 5 items using one of the suggested formats. D. Check the quality of your output based on the guidelines discussed above. As you do this, monitor your learning as well as your confusions, doubts or questions. E. Raise questions in class.
Essay Items
Relative to our learning intents, there are times when it is necessary that our students supply lengthy responses so that they exhibit more complex cognitive processes. For some learning targets, a single word, a phrase or, a sentence is not enough to measure students’ learning outcomes. For these targets, we need a constructed-response type of test that will allow students to adequately exhibit their learning through sufficient writing; hence, essay items work for these purposes.
148
Just like short-answer items, essay items call for students to produce rather than select answers from the given alternatives. But unlike short-answer items, essay items call for more substantial, usually lengthy response from students. Because the length and complexity of the response may vary, essay items are appropriate measures of higher-level cognitive skills. Following are some guidelines that will help you formulate good essay items.
Communicate the extensiveness of expected response By reading the essay item, your students must know exactly how brief or extensive their responses should be. This is made possible by making your item clearly convey the degree of extensiveness you expect from their response. Extensiveness depends on the degree of complexity of your item. To determine the degree of complexity you desire to assess, you may design an essay item according to any of the two types, depending, of course on your assessment objective– the restricted-response and extended-response items. If you wish to measure students’ ability to understand, analyze, or apply certain concepts to new contexts while dealing with relatively simple dimensions of knowledge, and if the task requires only a relatively short time period, the restricted-response type may be preferred. If, however, you wish to assess students’ capability to evaluate or synthesize various aspects of knowledge, which will naturally require longer time for their responses to be completed, the extended-response type is preferable. Notice that even at this phase of determining the degree of complexity of your essay item, it is very vital that you clearly make a decision based on your learning intent. This phase is crucial because if you design an essay item that is of extended type but give it to your students as if it is of restricted type, your students’ failure to meet the assessment standards set for the item may not be due to their level of learning, but rather because they needed more time to gather and process information before they could come up with responses that are relevant to your assessment standards. Your inference on students’ learning becomes problematically unreliable and invalid. Equally problematic your inference becomes if you construct a truly restricted type of essay item but give it as if it is an extended-type essay item.
Prime and prompt students through the item Unlike the other types of pen-and-paper tests, an essay item already includes the context, assessment task, and assessment focus standards, altogether. The statement of context provides a background of the subject matter in focus, and primes the students’ thinking of that subject matter. The prime helps students to be selectively attentive to a subject matter that is relevant to the assessment task of the essay item. Without it, students tend to grapple with understanding the subject matter that is embedded in the statement of assessment task, and may find it difficult to stay in focus. The assessment task is what the students directly respond to in order to write an essay. Both the statements of context (or the prime) and the assessment task (or the prompt) are important in setting the students’ attention to the subject matter and in making them think of a response that meets the assessment standards. Notice, for example, that if the item is phrased as “Compare and contrast the governance of Estrada and Arroyo,” students first struggle to generate some ideas related to these two names, then think of governance or political administrations of the two Philippine presidents in general sense. It is because the item does not
149
have a prime. In this case, the item is not helping the student stay in a clear focus of what the item really intends to assess. It will be different if the item begins with a prime, such as when phrased something like, “Our country has been run by a number of presidents already, and along with the change in political administration are the changes in the agenda of reforms. Compare and contrast the economic reform agenda of the presidential administrations of Estrada and Arroyo.” In this item, students are primed to think of the reform agenda on the two presidents, which is very probable that they focus more on the context as they respond to the assessment task. This latter example is not yet a complete essay item as it lacks other necessary elements, but it clearly shows how effectively you can prime and prompt students to appropriately respond to your test item. This item may be improved to become a full-blown essay item if you add other elements, such as the guide to the extensiveness of the desired response as well as the assessment standards.
Provide clear assessment standards You might think that, if it has both the prime and prompt, you item can already stand. This is not true. For an essay item to stand as a good one, it must also indicate a clear guide for the value of the item. The assessment standards inform the students about what specific aspects of their responses you will give merit, and what aspects will earn more credit than the others. If, for example, you give credit to their argument if they can provide an evidence, then you need to categorically ask for it in your essay item. Similarly, if you give two or three essay items and you wish to give more credit to one item based on its complexity, you also need to indicate the item’s value. This way, students know when and where they devote most of their time and effort, and decide how much of these resources will be invested to each item. One simple way of guiding students in term of the item’s value is to indicate the assessment weight you assign to the item in parentheses at the end of the item.
Do away with optional items While reading this part you may be recalling a common experience in taking an essay test when the teacher asked you to choose some, but not all, essay items to answer, and that you tended to choose those items that were more convenient to your understanding and readiness. This practice of providing optionality in essay items where students are made to choose fewer items to answer than what is presented should be stopped. From what you may recall in your experience, it is obvious that, when students are free to choose only a few items to answer, they would choose those items that are easy to them. As a consequence, each student will be choosing items that are “easy” to them, and, thus leads to flawed inference of students’ learning because students’ responses are marked under different standards and levels of complexity, depending on the items they chose to answer. One of the basic questions you will need to answer if you plan to do this is, What is the assurance that all your items have equal level of complexity and that they measure exactly the same knowledge domains and cognitive processes? This question is extremely difficult to answer. This guideline, therefore, says that if, for example, you have 3 essay items for the test you are about to administer, have each of your students answer all the 3 items.
150
Prepare a scoring rubric Because an essay item calls for relatively extensive response from the students, it is always necessary that you prepare a scoring rubric or guide prior to giving the test. The scoring scheme will help you pre-assess the validity and reliability of your item because it will allow you to identify the criteria as well as the key ideas you expect your students to give in response to the item. Your scoring rubric indicates the descriptions in scoring the quality of your students’ responses in the essay item. It includes a set of standards that define what is expected in a learning situation, and important indicators of how students’ responses to the task will be scored. Having said this, we ask you to choose the scoring approach that will best fit your assessment context. You have two options for this purpose. One is the holistic approach, another is the analytic approach. The holistic approach allows you to focus on your students’ overall response to an essay item. As you assess the response as a whole, this approach will guide you in terms of what dimensions of the learning outcome you pay attention to. For example, if your essay item intends to let students manifest their ability to argue with appropriate evidence, and explain in good, clear, and coherent language, you need to identify the dimensions that can capture those abilities in you assessment; hence you may have the following dimensions indicated in your holistic rubric: Logic of the argument, Relevance of evidence, Communicative clarity, Lexical choice, and Mechanics (spelling, punctuations, etc.). These dimensions serve as your criteria for assessing students’ response. It is always appropriate that you indicate the dimensions to assess because these dimensions keep you in focus as you assign a score to each of your students’ response. And for you to be guided further in terms of how much score to give, each dimension must be assigned a corresponding point or set of points. For example, you wish to give a maximum of 6 points for the logic of the argument, indicate it in your holistic rubric so that it might look the items in the box below.
Assessment Criteria
Points:
• Logic of the argument • Relevance of evidence • Communicative clarity • Lexical choice • Mechanics (spelling, punctuations, etc.).
6 4 3 2 2
Another way of setting a guide to scoring is by way of assigning the same points for each criterion but you also indicate the weight of each criterion based on its importance or value. The box below gives you a view of how the contents may look like.
151
Assessment Criteria
Points
• Logic of the argument • Relevance of evidence • Communicative clarity • Lexical choice • Mechanics (spelling, punctuations, etc.).
5 5 5 5 5
Weight 40% 35% 15% 5% 5%
When employing a holistic approach for scoring students’ responses in an essay item, your decision as to how much score to give based on each dimension is not guided by clear descriptions of the quality of response. It usually rests on the teacher’s judgment of the student’s response in terms of each criterion. Because this approach does not require specific descriptions of the quality of response, it is easy and efficient to use. The major weakness of this approach, however, is the fact that it does not specify the graded levels of performance quality which invites teachers’ subjective judgment of students’ response. Acknowledging its major weakness, we recommend that you use the holistic approach only for restricted-response items where students are tested only on less complex skills requiring only a small amount of time. In contrast, the analytic approach allows for a more detailed and specific assessment scheme in that it indicates not only the dimensions or criteria, but also the specific descriptions of the different levels of performance quality per criterion. Supposing we take the sample criteria in the boxes above and use them as the same criteria for our analytic rubric, we proceed by determining the levels of performance quality for each criterion. For the logic of the argument criterion we set a scale of varying performance quality, perhaps ranging from Excellent to Poor, with other levels of quality in between. A simple way to do this is exemplified in the box below.
Assessment Criteria
• Logic of the argument (40%) • Relevance of evidence (35%) • Communicative clarity (15%)
Scale Indicators Excellent Satisfactory Fair (8 pts) (6 pts) (4 pts) ____ ____ ____ ____ ____ ____ ____ ____ ____
Poor (2 pts) ____ ____ ____
152
• Lexical choice (5%) • Mechanics (5%)
____ ____
____ ____
____ ____
____ ____
As indicated in the box above, there are 4 scale indicators, each representing a level of performance quality. In this example, the teacher will put a check on the space below the scale indicator that matches the quality of a student’s response on every criterion. Scores are obtained by assigning points in every scale indicator. You may also specify the weight of each criterion depending on the degree of importance or value of the criterion. A more calibrated analytic rubric not only indicates the scale levels for the teacher to check against the quality of students’ response in an essay item, but also describe that performance quality that falls under each level of the scale. This rubric describes what quality of performance will qualify as “excellent” and what type of performance is “poor.” In this case, the analytic rubric should include descriptive statements for each scale level of each criterion. The table below shows an example of these descriptive statements applied to one of the criteria we used in the above example, just to illustrate the point.
Criterion
Logic of the argument
Scale Indicators
Excellent
Satisfactory
Fair
Poor
(7-8 points)
(5-6 points)
(3-4 points)
(1-2 points)
Argument is clearly premised on valid assumptions and is logically sequenced.
Argument is premised on valid assumptions with logical sequence only in some parts.
Some assumptions are weak and the argument is not completely logical.
Assumptions are generally too weak and the argument is problematic.
The good thing about using the analytic approach in scoring essay responses is that it helps you identify the specific level of students’ performance, and your assessment of students’ learning outcomes is objective. Therefore, it increases the reliability of your measure and will facilitate more valid and reliable inference. It is also beneficial for the students because, through the analytic rubric, they can pinpoint the specific level of their performance, and can judge its quality by matching it against the descriptions. This type of rubric is best for essay items that measure more complex cognitive skills and more sophisticated knowledge dimensions. Whichever approach you wish to use for scoring your students’ response to your essay items, your decision will work if you are already clear on the following questions:
153
• • •
What do you want your students to know and be able to do in the essay? How well do you want your students to know and be able to do it in the essay? How will you know when your students know it and do it well in the essay?
As you clarify your practice with reference to those questions, walk your way to constructing your scoring scheme using any approach, and following the simple steps indicated below. ü ü ü ü ü ü
Set appropriate assessment target. Decide on the type of the rubric to use. Identify the dimensions of performance that reflect the learning outcomes. Weigh the dimensions in proportion to their importance or value. Determine the points (or range of points) to be allocated to each level of performance. Show the rubric with colleagues and/or students before using it.
Some teachers are excited to use essay items because these items provide more opportunities to assess various types of learning outcomes, particularly those that involve higher level cognitive processing. If carefully constructed, essay items can test students’ ability to logically arrange concepts, analyze relationships between them; state assumptions or compare positions, evaluate them and draw conclusions; formulate hypotheses and argue on the causal relationships of concepts, organize information or bring in evidences to support some findings; propose solutions to certain problems and evaluate the solutions in light of certain criteria. These and much, much more competencies can be measured using good essay items.
References
Airasian, P. W. (2000). Assessment in the classroom: A concise approach. 2nd edition. USA: McGraw-Hill Companies. Popham, W. J. (2005). Classroom assessment: What teachers need to know. 4th edition. Boston, MA: Allyn and Bacon.
154
Chapter 5 Constructing Non-Cognitive Measures Objectives
1. 2. 3. 4.
Follow the procedures in constructing an affective measure. Determine techniques in writing items for non-cognitive measures. Use the appropriate response format for a scale constructed. Give the uses of non-cognitive tests.
Lessons
1 2 3
What are non-cognitive constructs? Steps in constructing non-cognitive measures Response Formats
155
Lesson 1 What are Non-Cognitive Construct
Human behavior is composed of multiple dimensions. Behaviors are characteristics in which one thinks, feels, and acts as people interact with their environment. The previous sections emphasized on techniques in assessing the cognitive domain as applied in creating teacher-made tests and analyzing the test using either the Classical Test Theory or the Item Response Theory approach. This chapter guides you in the construction of measures in the affective domain. Anderson (1981) explained affective characteristics as “qualities which presents people’s typical ways of feeling, or expressing their emotions” (p. 3). Sta. Maria and Magno (2007) found that affective characteristics run on two dimensions: Intensity and direction. Intensity refers to the strength of the characteristic expressed. The direction of affect refers to the cause of the affect from object external factors to person factors. Examples of intensity reflect high scores on certain affective measures such as aggression scales and motivation scales. The direction will be the cause of aggressive whether it is from an external person pr the self, whereas for motivation, the cause may be internal (ability) or material factors such as rewards. Figure 1 Dimensions of Affect High Intensity
Person
Object
Low Intensity
156
Affective characteristics are further classified according to specific variables such as attitudes, beliefs, interest, values, and dispositions. Attitude. Attitudes are learned predispositions to respond in a consistently favorable or unfavorable manner with respect to a given object (Meece et al. 1982). According to Meece et al. (1982) Attitude is related to academic achievement since attitudes are learned over time by being in contact with the subject area. Information about the subject area is received through instruction and consequently attitude is developed. Moreover, if a person is favorably predisposed toward an academic course, that favorable disposition should lead to favorable behaviors like achievement. According to Bandura (1977), attitude is often used in conjunction with motivation to achieve. It is how capable people judge themselves to perform a task successfully. Moreover extensive evidence and documentation were provided for the conclusion that attitude is a key factor in the extent to which people can bring about significant outcomes in their lives. According to Overmier and Lawry (1979) that one potential source of the drive to perform is the incentive value of the performance. Incentive theories of motivation suggest that people will perform an act when its performance is likely to result in some outcome they desire, or that is important to them. For example, in anticipation of a situation in which a person is required to perform, that person may expend considerable effort in preparation because of the mediation provided by the desire to achieve success or avoid failure. That desire would be said to provide incentive motivation for the person to expend the effort. Accordingly, a test, as a stimulus situation, may be theorized to provoke students to study as a response, because of the mediation of the desire to achieve success or avoid failure on that test. Studying for the test, therefore, would be the result of incentive motivation. In more objective terms attitude may be said to connote response consistency with regards to certain categories of stimuli (Anastasi, 1990). In actual practice, attitude has been most frequently associated with social stimuli and with emotionally toned responses (Anastasi 1990). Zimbardo and Leippe (1991) defined attitude as favorable or unfavorable evaluative reactions whether exhibited in beliefs, feelings, or inclinations to act toward something. According to Myers (1996) attitude is commonly referred to beliefs and feelings related to a person or event and their resulting behavior. Attitudes are an efficient way to size up the world. This means that when individuals have to respond quickly to something, the feeling can guide the way one reacts. Psychologists agree that knowing people’s attitude is to predict their actions. Attitudes involve evaluations. Attitude is an association between an object and our evaluation of it. When this association is strong, the attitude becomes accessible. Encountering the object calls up the associated evaluation towards it. One acquires attitude in a manner that makes them sometimes potent, sometimes not. An extensive series of experiments shows that when attitudes arise from experience, they are far more likely to endure and to guide actions. She concluded that attitudes predict actions if other influences are minimized, if it is specific to the action and it is potent. An example of an attitude scale is the “Attitude Towards Church Scale” by Thurstone and Chave (1929). The scale measure the respondents position on a continuum ranging from strong depreciation to strong appreciation of the church. It is composed of 45 items. A split-half measure of .848 was obtained and corrected by Spearman-Brown formula became .92. Discriminant validity was conducted where participants were classified according to their
157
religion where the catholic group had the highest mean score. In another discriminant validity the participants who frequently attended church had the highest mean. Example of items are: 1.
I think the teaching of the church is altogether too superficial to have much social and significance. 2. I feel the church services give me inspiration and help me to live up to my best during the following week. 3. I think the church keeps business and politics up to higher standard than they would otherwise tend to maintain.
Beliefs. Beliefs are judgments and evaluations that we make about ourselves, about others, and about the world around us (Dilts, 1999). Beliefs are generalizations about things such as causality or the meaning of specific actions (Pajares, 1992). Examples of belief statements made in the educational environment are “A quiet classroom is conducive to learning,” “Studying longer will improve a student’s score on the test,” “Grades encourage students to work harder.” Beliefs play an important part in how teachers organize knowledge and information and are essential in helping teachers adapt, understand, and make sense of themselves and their world (Schommer, 1990; Taylor, 2003; Taylor & Caldarelli, 2004). How and what teachers believe have a tremendous impact on their behavior in the classroom (Pajares, 1992; Richardson, 1996). An example of a measure of belief is the Schommer Epistemological Questionnaire. Schommer (1990) developed this questionnaire to assess beliefs about knowledge and learning. A 21-item questionnaire was developed by the researchers to measure epistemological beliefs of Asian students. The questionnaire was adapted from Schommer's 63-item epistemological beliefs questionnaire. This Asian version of the Schommer Epistemological Questionnaire has been validated with a sample of 285 Filipino college students. This epistemological questionnaire was revised to have lesser items, and simpler expression of ideas to be more appropriate for Asian learners. The number of statements was reduced to ensure that the participants would not be placed under any stressed while completing the questionnaires. Students are asked to rate their degree of agreement for each item on a 5-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree). Wording of items varied in voice from first person (I) to third person (students) in an effort to illustrate how the same belief could be queried from somewhat different perspectives. Items assessed four epistemological belief factors including beliefs about the ability to learn (ranging from fixed at birth to improvable), structure of knowledge (ranging from isolated pieces to integrated concepts), speed of learning (ranging from quick learning to gradual learning), and stability of knowledge (ranging from certain knowledge to changing knowledge). Schommer (1990) has reported reliability and validity testing for the Epistemological Questionnaire; the instrument reliably measures adolescents' and adults' epistemological beliefs and yields a four-factor model of epistemology. Schommer (1993) has reported test-retest reliability of .74. Factor analyses were conducted on the mean for each subset, rather than at the item level. Interests. Interest generally refers to individual's strengths, needs, and preferences. Knowledge of one’s interest strengthens understanding of career decision making and overall development. Strong (1955) defined interests as "a liking/disliking state of mind accompanying
158
the doing of an activity" (p. 138). Interests may be referred to as instrumental means to an end independent of perceived importance (Savickas, 1999). According to Holland’s theory, there are six vocational interest types. Each of these six types and their accompanying definitions are presented below: Realistic
Investigative Artistic
Social.
Enterprising
Conventional
People with Realistic interests like work activities that include practical, hands-on problems and solutions. They enjoy dealing with plants, animals, and real-world materials like wood, tools, and machinery. They enjoy outside work. Often people with Realistic interests do not like occupations that mainly involve doing paperwork or working closely with others. People with Investigative interests like work activities that have to do with ideas and thinking more than with physical activity. They like to search for facts and figure out problems mentally rather than to persuade or lead people. People with Artistic interests like work activities that deal with the artistic side of things, such as forms, designs, and patterns. They like self-expression in their work. They prefer settings where work can be done without following a clear set of rules. People with Social interests like work activities that assist others and promote learning and personal development. They prefer to communicate more than to work with objects, machines, or data. They like to teach, to give advice, to help, or otherwise be of service to people. People with Enterprising interests like work activities that have to do with starting up and carrying out projects, especially business ventures. They like persuading and leading people and making decisions. They like taking risks for profit. These people prefer action rather than thought. People with Conventional interests like work activities that follow set procedures and routines. They prefer working with data and detail rather than with ideas. They prefer work in which there are precise standards rather than work in which you have to judge things by yourself. These people like working where the lines of authority are clear.
Examples of affective measures of interest are the Strong-Campbell Interest Inventory and Strong Interest Inventory (SII), Jackson, Vocational Interest Inventory, Guilford-Zimmerman Interest Invmetory, Kuder Occupational Interest Survey. For a list of vocational interest tests, visit the site: http://www.yorku.ca/psycentr/tests/voc.html. Values. Values refer to “the principles and fundamental convictions which act as general guides to behavior, the standards by which particular actions are judged to be good or desirable (Halstead & Taylor, 2000, p. 169). These values are used as guiding principles to act and justify accordingly (Knafo & Schwartz, 2003). The values are internalized and learned at an early stage in life. The school setting is one major avenue where people show how the values are learned, respected and uphold. A student who strives for valuing education in school are provided with opportunities to behave in ways that will allow him to do well in school and thus attain values of hard work, perseverance and diligence when it comes to academic-related tasks. Examples of values are diligence, respect for authority, emotional restraint, filial piety, and humility.
159
An example of a measure of values is the Asian Values Scale-Revised (AVS-R). The AVS-R is a 25-item instrument designed to measure an individual’s adherence to Asian cultural values the enculturation process and the maintenance of one’s native cultural values and beliefs, (Kim & Hong, 2004). In particular, the AVS-R assesses dimensions of Asian cultural values, which include, “collectivism, conformity to norms, respect for authority figures, emotional restraint, filial piety, hierarchical family structure, and humility”. This instrument used a 4-point Likert scale with 1 being strongly disagree to 4 being strongly agree. A high score indicates a high level of adherence to Asian Values while a low score would indicates otherwise. Factors included are high expectations for achievement (e.g. One need not minimize or depreciate one’s own achievement – reverse worded), hierarchical family structure (e.g. One need not follow the role expectations of one’s family- reverse worded), respect for education (e.g. Educational and career achievements do not need to be one’s top priority – reverse worded), perseverance and hard work (e.g. One need not focus all energies on one’s studies- reverse worded), filial piety (e.g. One should avoid bringing displeasure to one’s ancestors), respect for authority (e.g. Younger persons should be able to confront their elders- reverse worded), emotional restraint (e.g. One should have sufficient inner resources to resolve emotional problems), and finally, collectivism (e.g. One should think of one’s group before himself). The AVS-R has the reliability of .80 and internal consistency coefficients of .81 and .82. Apart from this, a 2-week test-retest reliability of .83 was obtained. Construct validity was obtained by identifying, via a nationwide survey and focus groups discussions whereas, concurrent validity was obtained through confirmatory factor analyses, in which a factor structure comprising the AVS, the Individualism-Collectivisn Scale (Triandis, 1995), and the Suinn-Lew Asian Self Identity Acculturation Scale (SL-ASIA: Suinn, Rickard–Figueroa, Lew & Vigil, 1987) was confirmed. Discriminant validity was evidenced in a low correlation between the AVS scores, which reflect values enculturation and the SL-ASIA scores, which reflect predominantly behavioral acculturation. Dispositions. The National Council for Accreditation of Teacher Education (2001), which stated that dispositions are: the values, commitments, and professional ethics that influence behaviors toward students, families, colleagues, and communities and affect student learning, motivation, and development as well as the educator's own professional growth. Dispositions are guided by beliefs and attitudes related to values such as caring, fairness, honesty, responsibility, and social justice. Examples of dispositions include fairness, being democratic, empathy, enthusiasm, thoughtfulness, and respectfulness. Disposition measures are also created for metacognition, self-regulation, self-efficacy, approaches to learning, and critical thinking.
Activity
Use the internet and give examples of affective scales under each of the following areas. Attitudes
Beliefs
Interest
Values
Dispositions
160
Lesson 2 Steps in Constructing Non-Cognitive Measures
The steps involved in constructing an affective measure follow an organized sequence of procedures that when properly done results to a good scale. Constructing a scale is a research process where the test developer poses research questions, hypothesize on these questions, and gather data to provide evidence on the reliability and validity of the scale. Steps in Constructing a Scale Decide what information should be sought The construction of a scale begins with clearly identifying what construct needs to be measured. The basis of constructing a scales is when (1) no scales are available to measure such construct, (2) all scales are foreign and it is not suitable for the stakeholders or sample that will take the measure, (3) existing measures are not appropriate for the purpose of assessment, (4) the test developer intends to explore the underlying factors of a construct and eventually confirm it. Once the purpose of developing a scale is clear, the test developer decides what type of questionnaire to be used. Decide whether the measure will be an attitude, belief, interest, value, and disposition. When the specific variable construct is clearly framed, it is very important that the test developer search for relevant literature reviews from different studies involving the construct intended to be measured. What is needed in the literature review is the definition that the test developer wants to adapt and whether the construct has underlying factors. The definition and its underlying factors is the major basis for the test developer later on to write the items. Having a thorough literature review helps the test developer to provide a conceptual framework as basis for the construct being measured. The framework can come in the form of theories, principles, models, and a set of taxonomy that the test developer can use as basis for hypothesizing factors of the construct intended to measure. Having a thorough knowledge of the literature about a construct helps the researcher identify different perspectives on how the factors were arrived and possible problems with the application of these factors across different groups. This will help the test developer justify the purpose of constructing a scale. When the constructs and its underlying factors or subscales are established through thorough literature review, a plan to make the scale needs to be designed. The plan starts with creating a Table of Specifications. The Table of Specifications indicates the number of items for each subscale, the items phrased in positive and negative statements, and the response format. Write the first draft of items The test developer uses the definitions provided in the framework to write the preliminary items of the scale. Items are created for each subscale as guided by the conceptual definition. The number of items as planned in the Table of Specifications is also considered. As much as possible, a large number of items are written to represent well the behavior being measured. In helping the test developer write some items, a well represented set of behaviors manifesting the construct should be covered. Qualitative studies reporting the specific responses are very helpful in writing the items. An open-ended survey, focus group discussion, and
161
interviews can be conducted in order to come up with statements that can be used to write items. When these methods are employed as a start of item writing, the questions generally seeks for specific behavioral manifestations of the subscales intended to measure. An example would be the study of Magno and Mamauag (2008) where they created the “Best Engineering Traits” (BET) that measures dispositions of engineering students in the areas of assertiveness, intellectual independence, practical inclination, and analytical interest. The items in this scale were based on an open-ended survey conducted among engineering students. The survey asked the following questions: 1. How do you show your expertise in different situations as an Engineering student? 2. How do you apply engineering theories in your everyday life? 3. What are the instances that an Engineer needs to be assertive? 4. In what ways can an Engineer be independent in his intellectual thinking? 5. What do you think are other personality traits or characteristics that would make you an effective engineer?
Example of item statements generated from the survey responses are as follows: 1. 2. 3. 4.
I like watching repairmen when they are fixing something. I gather necessary information before making decisions. I hate to buy things in hardware stores. I do not rely on mathematical solutions in arriving at conclusions.
Notice that the item statements begin with the pronoun “I.” This indicates self-referencing for the respondents when they answer the items. Items 1 and 2 in the example are stated in a positive statement while items 3 and 4 are stated in negative. This ensures that respondents would be consistent with their answers in a subscale where the items should be responded in the same way. For negative items, reverse scoring is done with the responses to be consistent with the positive items. The following are guidelines in writing good items: Good questionnaire items should: 1. Include vocabulary that is simple, direct, and familiar to all respondents 2. Be clear and specific 3. Not involve leading, loaded or double barreled questions 4. Be as short as possible 5. Include all conditional information prior to the key ideas 6. Be edited for readability
162
Example of bad items: I am satisfied with my wages and hours at the place where I work. (Double Barreled) I not in favor congress passing a law not allowing any employer to force any employee to retire at any age. (Double Negative) Most people favor death penalty. What do you think? (Leading Question) Select a scaling technique After writing the items, the test developer decides on the appropriate response format to be used in the scale. The most common response formats used in scales are the Lickert scale (measure of position in an opinion), Verbal frequency scale (measure of a habit), Ordinal scale (ordering of responses), and the Linear numeric scale (judging a single dimension in an array). A detailed description of each scaling technique is presented in the next lesson. Develop directions for responding It is important that directions or instructions for the target respondents be created as early as when the items are created. When making instructions, it is very important that it is clear and concise. Respondents should be informed how to answer. When you intend to have a separate answer sheet, make sure to inform the respondents about it in the instructions. Instructions should also include ways of changing answers, how to answer (encircle, check, or shade). Inform the respondents in the instructions specifically what they need to do. The following are the instructions formulated for the BET: This is an inventory to find out your suitability to further study Engineering. This can help guide you in your pursuit of an academic life. The inventory attempts to assess what interests and strategies you have learned or acquired over the years as a result of your study. In the inventory, you will find statements describing various interests and strategies one acquires through years of schooling and other learning experiences. Indicate the extent of your agreement or disagreement to each of these statements by using the following scale: 4 3 2 1
STRONGLY AGREE AGREE DISAGREE STRONGLY DISAGREE
(SA) (A) (D) (SD)
There are no right or wrong answers here. You either AGREE or DISAGREE with the statement. It is best if you do not think about each item too long --- just answer this test as quickly as you can, BUT please DO NOT OMIT answering any item. DO NOT WRITE OR MAKE ANY MARKS ON THE TEST BOOKLET. All answers are to be written on your answer sheet. Ensure that you have filled out your answer sheet properly and legibly for your name, school, date of birth, age, and gender.
163 Be sure also that you have copied correctly you test booklet number on the space provided in your answer sheet. Do not turn the page until you are told to do so. You have a total of 40 minutes to finish this whole test. Do not spend a lot of time in any one item. Answer all items as truthfully and honestly as you can.
Notice that the instruction started with the purpose of the test. This is done to dispel any misconceptions that the respondents think about the test. Then the instruction describes the kind of items expected for the test. Then the respondent is told how to answer the items. The scaling technique is also provided. The respondents are reminded that there are no right or wrong answers to avoid faking good or bad in the test. The respondents are reminded such as not making any marks on the test booklet, use of answer sheets, answering all items and the time allotment. As much as possible, detailed instructions are provided to avoid any problems. Conduct a judgmental review of items For achievement tests and teacher made tests, this procedure is called content validation. But for affective measures, it would be difficult to conduct content validation because there is no available content area for an affective variable. The definition and behavioral manifestations from empirical reports can qualify for the areas measured. Instead the items are reviewed according to the definition or framework provided whether they are relevant, not within the confines of the theory, measuring something else, applicability of the target respondents, and whether it needs revision for clarity. Item review is conducted among experts in the content being measured. In the process of item review, the together with the constructed items, the conceptual definition of the constructs are provided to guide the reviewer to ensure that the items are framed. It is also necessary to arrange the items according to each subscale where is belongs so that the reviewer can easily evaluate the appropriateness of the items in that subscale. A suggested format for item review is shown below: Practical Inclination – finding meaning about concepts and adapt to, shape, and select environments covering a wide range of applications (Sternberg, 2004). Application, putting into practice, use knowledge, implement, propose something new.
Accept
Reject
Revise
Suggested Revision
1. I like fixing broken things in the house. 2. I help out my father fix broken things in the house. 3. I help do the manual computation if there is no available calculator. 4. I help my friends organize their schedule if they do not know what to consider.
When giving items for review, the test developer write a formal letter to the reviewer and indicate specifically how do you want the review to be done. Indicate specifically if you also intent to review the grammar of the statement because most reviewers would just focus on the content and its frame on the definition.
164
Reexamine and revise the questionnaire After the items have been reviewed expect that there would be several corrections and comments. Several comments indicate that the items will be better because it have been thoroughly studied and critiqued. In fact, several comments should be more appreciated than few because it means that the reviewers are offering better ways on how to fix and reconstruct your items. In this stage, it is necessary to consider the suggestions and comments provided by the reviewer. If there are things that are not clear to you, do not hesitate to go back and ask the reviewer once more. This will ensure that the items will be better when the final form of the scale is assembled. Prepare a draft and gather preliminary pilot data Preparing the items for pilot testing requires a layout of the test for the respondents. The general format of the scale should be emphasized on making it as easy as possible to use. Each item can be identified with a number of a letter to facilitate scoring of responses later. The items should be structured for readability and recording responses. Whenever possible items with the same response formats are placed together. In designing self-administered scales, it is suggested to make it visually appealing to increase response rate. The items should be self-explanatory and the respondents can complete it in a short time. In ordering of items, the first few questions set the tone for the rest of the items and determine how willingly and conscientiously respondents will work on subsequent questions. Before going to the actual pilot test, the items can be administered first to at least 3 respondents who belong in the target sample and observe them in some areas that take them long in answering and if the instructions are clearly followed. A retrospective verbal report can be conducted while the participants are answering the scale to clarify any difficulties that might arise in answering the items. In the actual pilot testing, the scale is administered to a large sample (N=320). The ideal number of sample would be three time the total number of items. If there are 100 items in the scale, the ideal sample size would be 300 or more. Having a large number of respondents makes the responses more representative of the characteristic being measured. Large sample tends to make the distribution of the scores assume normality. In administering a scale, the proper testing condition should be maintained such as the absence of distractions, room temperature, proper lighting, and other aspects that can cause large measurement errors. Analyze Pilot data The responses in the scale should be recorded using a spreadsheet. The numerical responses are then analyzed. The analysis consists of determining whether the test is reliable or valid. Techniques of establishing validity and reliability are explained in chapter 3. If the test developer intends to use parallel forms or test-retest, then two time frames would be set in the design of the testing. The analysis of items would indicate whether the test as a whole or the individual items are valid or reliable. If principal components analysis is conducted, each item will have a corresponding factor loading, the items that do not highly load on any factor are removed from
165
the item pool. If certain items when removed would also increase the Cronbach’s alpha reliability of the test. These techniques suggest removing certain items to improve the index of reliability and validity of the test. This implies that a new form is produced complying with the results of the items analysis. That’s why it is needed to have a large pool of item to begin with because not all items will be accepted in the final form of the test. Revise the Instrument The instrument is then revised because items with low factor loadings are removed, items that when removed will increase Cronbach’s alpha is also considered. In the process of principal components analysis even though the test developer has proposed a set of factors these factors may not hold true because items will have a different grouping. The test developer then thinks of new factor labels for the new grouping of items. These cases necessitate the test developer to revise the items and come up with another revised form. This revised form is again administered to another large sample to collect evidence of the scale being valid or reliable. Gather final pilot data For the final pilot data gathering, a large sample is again selected which is three times the number of items. The sample should have the same characteristics as with the first pilot sample. The data gathered would serve to establish the final estimates of the tests validity and reliability. Conduct Additional Validity and Reliability Analysis The validity and reliability is again analyzed using the new pilot data. The test developer wants to determine if the same factors will still be formed and whether the test will still show the same index of reliability. Edit the questionnaire and specify the procedures for its use Items with low factor loadings are again removed resulting to less items. A new form of the test with reduced items will be formed. The remaining items have evidence of good factor loadings. The final form of the test can now be formed. Prepare the Test Manual The test manual indicates the purpose of the test, instructions in administering, procedure for scoring, interpreting the scores including the norms. Establishing norms will be fully discussed in the next chapter. Activity
Think of a construct that you want to study for a research or for your thesis in the future. Follow the steps in test construction in developing the scale.
166
Lesson 3 Response Formats
This lesson presents the different scaling techniques used in tests, questionnaires, and inventories. The important assumption for putting scales on tests and questionnaires is to provide quantities and figures that can be analyzed and interpreted statistically. One characteristic of research is that it should be measurable, through scales we are able to measure and quantify concepts under study. Scales also enable the results be analyzed by mathematical formulas to arrive with quantities of results. The scaling techniques discussed here can be categorized accordingly to the levels of measurement such as nominal, ordinal, interval, and ratio. In some references, the scaling techniques come in conjunction with the levels of measurement. The purpose of mentioning the level of measurement is to separate them as a topic and how they are related to scaling techniques. According to Bailey (1996) scaling is a process of assigning numbers or symbols to various levels of a particular concept that we wish to measure. Scales can either be open-ended or close-ended. For open-ended questions scales refer to the criteria set in order to effectively and objectively assess the information presented. For close-ended questions, scales refer to response formats for certain concepts and statements. Varieties of these scales serving as a response format on tests and questionnaires will be presented in this report. Before I present the varieties of scaling techniques the following should be remembered as a framework for discussion: (1) What kind of question is this scale used for? (2) What general behavior does this scale measure? (3) What is the unique feature of this scaling technique? (4) What are the advantages and disadvantages in using this scale? Classification and Types of Scales The scaling techniques are be classified according to three categories base on the type of question they are used. These categories are scaling techniques for Multiple Choice Questions, Conventional scale types used for measuring behavior on questionnaires, Scale Combinations, Nonverbal scales for questions requiring illustrations and Social Scaling for obtaining the profile of a group (Alreck & Settle, 1995). MULTIPLE CHOICE QUESTIONS Multiple choice questions are common and known for being simple and versatile. They can be used to obtain mental ability and a variety of behavioral patterns. This is ideal for responses that fall into discrete categories. When the answers can be expressed as numbers, a direct question should be used, and the number of units should be recorded.
167
1. Multiple Response Item In this scaling technique the respondents can indicate one or more alternatives, and they are instructed to check any within the question, itself. In this case each alternative becomes a variable to be analyzed. Please check any type of food that you regularly eat in the cafeteria. ___ Hamburger ___ Pasta ___ Soup ___ Fried chicken ___ French fries
2. Single-Response Item In this scaling technique one alternative is singled out from among several by the respondent. The item is still multiple choice but only one response is required. Single response items can be used only when (1) the choice criterion is clearly stated and (2) the criterion actually defines a single category. What kind of food do you most often eat in the cafeteria? (Check only one) ___ Hamburger ___ Pasta ___ Soup ___ Fried chicken ___ French fries
CONVENTIONAL SCALE TYPES These types of scales are commonly used for surveys. Every information need or survey question can be scaled effectively with the use of one or more of the scales. One should remember that the decision of scaling technique is a matter of choice among the conventional scales. 3. Lickert Scale Used to obtain people’s position on certain issues or conclusions. This is a form of opinion or attitude measurement. In this scale the issue or opinion is obtained from the respondents’ degree of agreement or disagreement. The advantage of this scale include flexibility, economy, and ease of composition. The procedure is flexible because items can be only a few words long, or they can consist of several lines. The method is economical because one set of instructions and scale ca serve many items. The respondent can quickly and easily complete the items. Also, the Lickert scale enables to obtain a summated value. Beside obtaining the results of each item, a total score can be obtained from a set of items. The total value would be an index of attitudes toward the major issue, as a whole. Please pick a number from the scale to show how much you agree or disagree with each statement and jot it in the space to the left of the item.
168
1 2 3 4 5
Scale Strongly agree Agree Neutral Disagree Strongly disagree
____ 1. I can get a good job even if my grades are bad. ____ 2. School is one of the most important things in my life. ____ 3. Many of the things we learn in class are useful. ____ 4. Most of what I learn in school will be useful when I get a job. ____ 5. School is not a waste of time. ____ 6. Dropping out of school would be a huge mistake for me. ____ 7. School is more important than most people think.
4. Verbal Frequency Scale The verbal frequency scale contains five words that indicate how often an action has been taken. This scale is used to know the frequency of some action or behavior by respondents. A straight forward question is recommended when the absolute number of times is appropriate and required. In using the verbal frequency scale, the researcher wants to know the proportion of percentage of activity, given an opportunity to perform it. The advantage of using this scale is that respondents are not forced with the precision of recollection exactly how many times they have behaved in a certain way. Another is the ease of assessment and response by those being surveyed. It has the ability to array activity levels across a five category spectrum for data description, and the ease of making comparisons among subsamples or among different actions for the same sample of respondents. A disadvantage is that it provides only a gross measure of proportion. Please pick a number from the scale to show how often you do each of the things listed below and jot in the space at the left. Scale 1 Always 2 Often 3 Sometimes 4 Rarely 5 Never ___ 1. I take a brunch at the middle of breakfast and lunch. ___ 2. I take a light snack 4 hours after lunch. ___ 3. I take midnight snack.
5. Ordinal Scale The ordinal scale is also a multiple choice item but the response alternatives don’t stand in any fixed relationship with one another. The response alternatives define an ordered sequence. The responses are ordinal because each time a category is listed, it comes before the next one. The principal advantage of the ordinal scale is the ability to obtain a measure relative to some other benchmark. The order is the major focus and not simply the chronology.
169 Ordinarily, when do you or someone in your family would read a pocket book at home on a weekday? (Please check only one) ___ The first thing in the morning ___ A little while after awakening ___ Mid-morning ___ Just before lunch ___ Right after lunch ___ Mid-afternoon ___ Early evening before dinner ___ Right after dinner ___ Late evening ___ Usually don’t read pocket books
6. Forced Ranking Scale The forced ranking scale produce ordinal values and items are each ranked relative to one another. This scaling technique obtains not only the most preferred, but also the sequence of the remaining items. One of the main advantage of this scaling technique is that the relativity or relationship that’s measured is among the items. The forced ranking scale indicates what those choices are likely to be, from unlimited number of alternatives. The limitation is its failure to measure the absolute standing and the interval between items. The number of entities or items that can be ranked is also a limitation. Respondents must first go through the entire list and identify their first choice. Please rank the books listed below in their order of your preference. Jot the number 1 next to the one you prefer most, number 2 by your second choice, and so forth. ___ Harry Potter series ___ Lord of the Rings Series ___ Twilight ___ The Lion, the Witch, and the Wardrobe
7. Paired Comparison Scale This scale is used to measure simple, dichotomous choices between alternatives. The focus must be almost exclusively on the evaluation of one entity relative to another. This scaling is accomplished where only two items are ranked at a time. One major problem is the lack of transitivity in which there are several pairs to be ranked. If the data is summated there are cases of “ties.” These limitations are avoided by using ratings, rather than rankings, of items taken two at a time.
170 For each pair of study skills listed below, please put a check mark by the one you most prefer, if you had to choose between the two. ___ Note taking ___ Memorizing ___ Memorizing ___ Graphic organizer ___ Note taking ___ Graphic organizer
8. Comparative Scale The comparative scale is appropriate when making comparison(s) between one object and one or more others. With this type of scale, one entity can be used as the standard or benchmark by which several others can be judged. The advantage of this scale is that no absolute standard is presented or required and all evaluations are made on a comparative basis. Ratings are all relative to the standard or benchmark used. When there is no absolute standard that exist, the comparative scale approach is applicable. Another advantage is its flexibility. The same two entities can be compared on several dimensions or criteria, and several different entities can be compared with the standard. The comparative scale is used for research interests on comparisons of own sponsor’s store, brand, institution, organization, candidate, or individual with that of others that are competitive. According to Alreck & Settle (1995), that the comparative scales are more powerful in several ways: They present an easy, simple task to the respondent, ensuring cooperation and accuracy. They provider interval data, rather than only ordinal values, as rankings do. They permit several things that have been compared to the same standard to be compared with one another, and economy of space and time are inherent in them. Compared to the previous teacher, the new one is… (Check one space) Very Superior 1
About the same 2
3
Very Inferior 4
5
9. Linear, Numeric Scale The linear, numeric scale is used in judging a single dimension and arrayed on a scale with equal intervals. The scale is characterized by a simple, linear, numeric scale with extremes labeled appropriately. This scaling technique is economical, since a single question, set of instructions, and rating scale apply to many individual items. It also provides absolute measures of importance and relative measures, or rankings, if responses among the various items are compared. The linear, numeric scale is less appropriate for measuring approximate frequency, and not applicable when direct comparison with a particular standard is required.
171 How important to you is each of the people in the school below? If you feel that the people in the school is extremely important, pick a number from the far right side of the scale and jot it in the space beside the item. If you feel it’s extremely unimportant, pick a number from the far left, and if you feel the importance is between these extremes, pick a number from some place in the middle of the scale to show your opinion. Scale Extremely Unimportant 1 ___ 1. ___ 2. ___ 3. ___ 4. ___ 5. ___ 6. ___ 7. ___ 8. ___ 9.
2
3
4
5
Extremely Important
Directress Principal Teachers Academic Coordinator Discipline officer Cashier Registrar Librarian Janitor
10. Semantic Differential Scale In using this scaling device, the image of a brand, store, political candidate, company, organization, institution, or idea, can be measured, assessed, and compared with that of similar topic. The areas investigated are called entities. In using this scale the researcher must first select a series of adjectives that might be used to describe the topic object. The attributes used by the researcher should be relevant in the minds of the respondents. Once the adjectives have been identified, a polar opposite of each adjective must be determined. The advantage of this scale is its ability to portray images clearly and effectively. The results provide a profile of the image of the topic that’s rated because several pairs of bipolar adjectives are used. Also, the entire image profiles can be compared with one another. Another advantage is its ability to measure ideal images or attribute levels. The disadvantage lies on the difficulty to arrive with antonyms of the concepts for each item. Please put a check mark in the space on the line below to show your opinion about the school guidance counselor Empathic
______ 1
2
3
4
5
6
7
Approachable
______
______
______
______
_____
_____
_____
1
2
3
4
5
6
7
______
______
______
______
_____
_____
_____
1
2
3
4
5
6
7
______
______
______
______
_____
_____
_____
1
2
3
4
5
6
7
Understanding Unconditional
______
______
______
_____
_____
_____
Apathetic Aloof Defensive Conditional
172
11. Adjective Checklist This scale is used to view descriptive adjectives or phrases that apply to the topic or object of study. As compared with the semantic differential scale, the adjective checklist is a very straightforward method of obtaining information about how a topic is described and viewed. The advantage of the adjective checklist is its simplicity, directness, and economy. The adjectives listed can be varied. Short descriptive phrases can even be used. This is useful in doing exploratory research work. The disadvantage of this scale is the dichotomous data it yields. There’s no indication how much each item describes the topic. Please put a check mark on the space in from of any word that describes your school. ___ Easy ___ Technical ___ Boring ___ Interesting
___ Safe ___ Exhausting ___ Difficult ___ Rewarding
12. Semantic Distance Scale The semantic distance scale includes a linear, numeric scale below the instructions and above the descriptive adjectives or phrases. It requires the respondents to provide a rating of how much each item describes the topic. The data generated by the scale is interval distance from the item to the topic. This scale is also used to portray an image. The advantage of this scale is that the adjectives or images can be specified without comparing it to its opposite and with the interval data that it can produce, it can be manipulated and statistically processed. The disadvantage is its great complexity, the respondents’ task is more difficult to explain. Please pick a number from the scale to show how well each word or phrase below describes your teacher and jot it in the space in front of each item.
Not at all ___ Intelligent ___ Strict ___ Respected
1
2
3
Scale 4
5
6
7
Perfectly
___ Approachable ___ Good in teaching ___ Can control the class
13. Fixed Sum Scale The fixed sum scale is used to determine what proportion of some resource or activity has been devoted to each of several possible choices or alternatives. The scale is most effective when it’s used to measure actual behavior or action in the recent past. Ordinarily, about 10 different categories are the maximum, but as few as 2 or 3 can be used. The number to which the data must total has to be very clearly stated. The major advantage of this scale is its simplicity and clarity. The instructions are easily understood and the respondent task is ordinarily easy to complete. It’s also important to add an inclusive alternative for “others.”
173 Of the last 10 times you went to the library, haw many times did you visited each of the following library sections. ___ Reference ___ Periodical ___ Circulation ___ Filipinana ___ Other (What? __________________)
SCALE COMBINATION The scale combinations take the form by listing items together in the same format in which they share a common scale. This saves valuable questionnaire space. It reduces the response task and facilitates recording. The respondents mentally carry the same frame of reference and judgment criteria from one item to the next, so the data are closely comparable. 14. Multiple Rating List It is a commonly used variation of the linear, numeric scale. The difference is that the multiple rating list has the labels of the scale extremes at the top. The scale itself is listed beside each item. The advantage is that all the respondent has to do is circle a number, and that’s easier than writing it and the responses for a visual pattern. So the juxtaposition on the responses on a horizontal spectrum is a closer mapping to the way people actually think about the evaluations they’re making. Several colleges and universities are listed below. Please indicate how safe or risky is their location by circling the number beside it. If you feel it’s very safe, circle a number towards the left. If you feel it’s very risky, circle one towards the right, and if you think it’s some place in between, circle a number from the middle range that indicates your opinion. University of the Philippines De La Salle University-Manila Ateneo de Manila University Mapua Technical Institute University of Sto Tomas
Extremely Safe 1 2 1 2 1 2 1 2 1 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
Extremely Risky 6 7 6 7 6 7 6 7 6 7
15. Multiple Rating Matrix It is a condensed format in using a combination of linear, numeric scale items. The difference lies in the way the items are listed in a matrix of rows with multiple columns. This scaling technique has two advantages: First, it saves questionnaire space. The multiple rating matrix takes less questionnaire space, yet it captures many data points. The objects and their characteristics that are rated are all very close to one another. The respondents are readily able to compare their evaluations from one rating object to another. The disadvantage is that is in terms of its complexity. The instructions are complex and the task is a bit difficult.
174 The table below lists 3 universities, and several characteristics of universities along the left side. Please take one university at a time. Working down the column, pick a number from the scale indicating your evaluation of each characteristic and jot it on the space in the column below the university and to the right of the characteristic. Please fill in every space, giving your rating for each university on each characteristic. Scale Very Poor
1
2
University of the Philippines
3
4
5
De La Salle University-Manila
6
Excellent Ateneo de Manila University
Faculty Research Facilities Services
16. Diagram Scale The diagram scale is useful for measuring configurations of several things, where the special relationship convey part of the meaning. Please list the ages of all those in your class in the spaces below. Jot the ages of the boys in the top circles and the ages of the girls in the bottom circles.
Boys Girls
♂ ♀
♂ ♀
♂ ♀
♂ ♀
♂ ♀
♂ ♀
NONVERBAL SCALES The Nonverbal scales take the form of pictures and graphs to obtain the data. This is useful for respondents who have limited ability to read or to understand numeric scales. 17. Picture Scale It facilitates the respondents to recognize letters, numbers, and other symbols using recognized facial expressions and other illustrations. Some points to consider in creating picture scale are: (1) They must be very easy for respondents to understand. (2) They should show something respondents have often seen. (3) They should represent the thing that’s being measured. (4) They should be easy to draw or create. 18. Graphic Scale The graphic scale shows in ascending or descending order the amount of information that is being quantified. The graphic scale provides more useful measurement data because the
175
extremes visually represent none and all or total. Picture and graphic scales are most often used only for personal interview surveys because they are designed for a special need. Which of the faces indicates your feeling about your math course?
How much have you learned in your math course?
5
4
3
2
What is the level of your math proficiency?
1
176
SOCIAL SCALING Social scaling as defines by Lazarfield (1958) as “properties of collectives which are obtained by performing some operation on data about the relations of each member to some or all of the other.” 19. Sociometric Scaling Sociometric measures are generally constructed by administering to all members of the group a questionnaire asking each about his or her relations with the other members of the group (Bailey, 1995). One way of analyzing sociometric data is in the form of the sociometric matrix. A sociometric matrix lists the persons’ names in both the rows and columns, and uses some code to indicate which person is chosen by the subject in response to the question. 20. Sociogram The sociogram is a graphic representation of sociometric data. In a sociogram each individual is represented by an illustrative symbol. The symbols are then connected by arrows and it describes the relationship among the individuals involved. Those chosen most often are referred to as stars, those not chosen by others are called isolates, and the small groups made up of individuals who choose one another are called cliques (Best & Khan, 1990). Scale Selection Criteria Some scales are easily identified as potentially useful for obtaining some information, needs, and questions, and there are often other scales that are clearly inappropriate. How to create effective scales? 1. Keep it simple. The less complex scale should be used. Even after identifying a scale consider an easier and simpler scale. 2. Respect the respondent. Select scales that will make a quick and easy as possible for the respondents that will reduce non-response bias and improve accuracy. 3. Dimension the response. The dimensions that respondents think is not usually common with one another, some commonality must be discovered. It must not be obscure and difficult, and they should parallel respondents thinking. 4. Pick the denominations. Always use the denominations that are best for respondents. The data can later be converted to the denominations sought by information users. 5. Choose the range. Categories or scale increments should be about the same breadth as those ordinarily used by respondents. 6. Group only when required. Never put things into categories when they can easily be expressed in numeric terms. 7. Handle neutrality carefully. If respondents genuinely have no preference, they’ll recent the forced choice inherent in a scale with an even number of alternatives. If feelings aren’t especially strong, an odd number of scale points may result in fence- riding or piling in the midpoint, even when some preference exist.
177
8. State instructions clearly. Even the least capable respondents must be able to understand. Use language that’s typical of the respondents. Explain exactly what the respondent should do and the task sequence they should follow. List the criteria by which they should judge and use an example or practice if there is any doubt. 9. Always be flexible. The scaling techniques can be modified to fit the task and the respondents. 10. Pilot test the scales. Individual parcels can be checked with a few typical respondents.
References
Anastasi, A. (1990). Psychological testing. New York: McMillan Pub. Anderson, L. W. (1981). Assessing affective characteristics in the schools. Boston: Allyn and bacon. Alreck, P. L. & Settle, R. B. (1995). The survey research handbook (2nd ed.). Chicago: Irwin Prof. Books. Bailey, K. D, (1995). Methods of social research (4th ed.). New York: McMillan Pub. Bandura, A. (1977). Self-efficacy: Toward a unifying theory of behavioral change. Psychological Review, 84, 191-215. Best, J. W. & Kahn, J. V. (1995). Research in education (6th ed.). New Jersey: Prentice Hall. Dilts, R. B. (1999). Sleight of mouth: The magic of conversational belief change. Capitola, CA: Meta Publications. Halstead, J. M. & Taylor, M. J. (2000). Learning and teaching about values: A review of recent research. Cambridge Journal of Education, 30, 169-203. Knafo, A. & Schwartz, S.H. (2003). Parenting and adolescents’ accuracy in perceiving parental values. Child Development, 74(2), 595-611. Lazarfield, P. F. (1958). Evidence and Inference in Social Research. Daedalus, 8, 99-130. National Council for Accreditation of Teacher Education. (2001). Professional standards for the accreditation of schools, colleges, and departments of education. Washington, DC: Author. Meece, J., Parson, J., Kaczala, C., Goff, S., & Futerman. R. (1982). Sex differences in math achievement: Toward a model of academic choice. Psychological Bulletin, 91, 324 – 348.
178
Overmier, J.B.& J. A. Lawry. (1979). Conditioning and the mediation of behavior. In G.H. Bower (ed.). The psychology of learning and motivation (pp. 1- 55). New York: Academic Press. Pajares, M. F. (1992). Teachers' Beliefs and Educational Research: Cleaning Up a Messy Construct. Review of Educational Research, 62, 307-332. Richardson, V. (1996). The role of attitude and beliefs in learning to teach. In J. Sikula, T. Buttery, & E. Guyton (Eds.), Handbook of research on teacher education ( pp. 102-119). New York: Macmillan. Savickas, M. L. (1999). The psychology of interests. In M. L. Savickas & A. R. Spokane (Eds.), Vocational interests: Meaning, measurement and counseling use (pp. 19-56). Palo Alto, CA: Davies-Black. Schommer, M. (1990). Effects of beliefs about the nature of knowledge on comprehension. Journal of Educational Psychology, 82, 498-504. Sta. Maria, M. & Magno, C. (12007). Dimensions of Filipino negative emotions. Paper presented at the 7th Conference of the Asian Association of Social Psychology, July 25-28, 2007 in Kota Kinabalu, Sabah, Malaysia. Strong, E. K. (1955). Vocational interests 18 years after college. Minneapolis: University of Minnesota Press. Taylor, E. (2003). Making meaning of non-formal education in state and local parks: A park educator's perspective. In T. R. Ferro (Ed.), Proceedings of the 6th Pennsylvania Association of Adult Education Research Conference (pp. 125-131). Harrisburg, PA, Temple University. Taylor, E., & Caldarelli, M. (2004). Teaching beliefs of non-formal environmental educators: A perspective from state and local parks in the United States. Environmental Education Research, 10, 451-469. Zimbardo, P. G, & Leippe, M. R. (1991). The psychology of attitude change and social influence. New York: McGraw Hill.
179
Chapter 6 Art of Questioning Chapter Objectives:
1. Develop a deep understanding on the functional role of questioning in enhancing students’ learning; 2. Critically assess the circumstances under which certain types of questions may be more useful; 3. Frame questions that are appropriate for the target skills to be developed in the students Lessons:
1. Functions of Questioning 2. Types of Questions 3. Taxonomic Questioning
Lesson 1 Functions of Questioning
Every time we get inside our classrooms and deal with our students in various teaching and learning circumstances, our ability to ask questions is always brought to fore. Being intricately embedded in our pedagogies and assessments, questioning is one of the most basic processes we deal with. But, just how appropriate our questions are, we need to discuss about the art of questioning. To begin with, we ask ourselves this fundamental question, “Why do we ask questions?” From our teaching methods and strategies to our assessments, questioning is inevitable. From the transmissive to more constructivist approaches of teaching, asking questions is always a “mainstay” process. To answer this fundamental question, we need to first look into the function of questioning as its works in our own selves, then in terms of how it works in the learning process in general. As you are reading this chapter, or even the previous chapter of this book, you effortlessly ask questions. Why is that? How important is that process in our understanding of the concepts are trying to learn about? Whenever you ask a question, regardless of whether you just keep it in mind or express it verbally, you activate your senses and drive your attention to what you are currently processing. As you engage a reading material, for example, and you ask questions about what you are reading, you are bringing yourself into a deeper level of the learning experience where you become more “into the experience.” Obviously, questioning brings you to the level of focused and engaged learning as you become particularly attentive to everything that takes place within and around you.
180
In the classrooms, we ask our students many questions without always being aware that the kinds of questions we ask make or break students’ deep academic engagement. At this juncture, therefore, we emphasize the point that, as teachers, just asking questions is not enough to bring our students to the level of engagement we desire. What matters in this case is the quality of the questions we ask them. The effects of questioning on our students differ, depending on how “good” or “bad” our questioning is. From various studies, we now know that “good” questioning positively affects students’ learning. Teachers’ good questioning boosts students’ classroom engagement because the atmosphere where good questions are tossed encourages them to push themselves some more into the state of inquiry. If students’ feel that questions are interesting, sensible, and important, they are driven not only to “know more” but also “think more.” Good questioning encourages deep thinking and higher levels of cognitive processing, which can result in better learning of the subject matter in focus. One distinct mark of a classroom that employs good questioning is that students generally participate in a scholarly conversation. This happens because teachers’ good questioning encourages the same good questioning from the students as they discuss with their teachers and with each other. On the contrary, bad questioning distorts the climate of inquiry and investigation. It undermines the students’ motivation to “know more” and “think more” about the subject matter in focus. If, for example, a teacher’s question makes the student feels stupid and impossibly capable of answering, the whole process of questioning leads to a breakdown of students’ academic engagement. Indeed, it is important for a teacher to always think of his or her intentions for tossing questions in the class. Certainly, questions encouraged by a sound motive will work better that those ill-motivated ones.
Think about questioning as a tool for increasing students’ academic engagement. Mentally explore into what kinds of motives may encourage learning and what other kinds of motives that may undermine students’ learning. Write your thoughts or bring it up in class for purposes of academic discussion.
181
Lesson 2 Types of Questions
Now that you have just explored on the kinds of motives that may encourage or undermine students’ learning, it is helpful if you focus on those motives that establish an atmosphere of inquiry in your classrooms. Focus on those intentions that will allow for the use of questioning as a tool for deep learning rather than those that embarrass students and discourage them from engaging your lessons. However, because teaching is not a trial-and-error endeavor, motives might not be enough to guide our questioning so that it makes desirable effects on our students’ learning. With the sound motive being the undercurrent of our questioning, we need to also know what types of questions to ask to engage our students.
Interpretive Question This type of question calls for students’ interpretation of the subject matter in focus. It usually asks students to provide missing information or ideas so that the whole concept is understood. An interpretive question assumes that, as students engage the question, they monitor their understanding of the consequences of the information or ideas. In a class with primary graders, the teacher narrated a story a boy in dark-blue shirt who was lost in a crowd of people at a carnival one evening, and his mother roved around for hours to find him. After narrating the story, one of the questions the teacher asked her pupils was “If the boy wore a bright-colored shirt, what could change in the mother’s effort in looking for the boy?” Question that call for interpretation of a situation is a powerful tool for activating your students’ analytical ability.
Inference Question If the question you ask intends that students go beyond available facts or information and focus on identifying and examining the suggestive clues embedded in the complex network of facts or information, you may use toss up an inference question. After a series of discussions on the Katipunan revolution in a Philippine history class, the teacher presented a picture that appeared to capture a perspective of the Katipunan revolution. As the teacher showed the picture, he asked, “What do you know by looking at this picture?” Having learned about Katipunan revolution in its different angles, students were prompted to explore on clues that may suggest certain perspectives of the event, and focus on a more salient clue that represented one perspective, such as, for instance, the common people’s struggles during the revolution, or the bravery of those who fought for the country, or the heroism of its leaders. Inference questions encourage students to engage in higher-order thinking and organize their knowledge rather than just randomly fire out bits and pieces of information.
182
Transfer Question Questioning is one of the processes that affect transfer (Mayer, 2002). Transfer questions are tools for a specific type of inference where students are asked to take their knowledge to new contexts, or bring what they already know in one domain to another domain. Questions of this type bring students to a level of thinking that goes beyond just using their knowledge where it is used by default. For example, after a lesson on the literary works of Edgar Allan Poe, students were already familiar with Poe’s literary style or approach. So that the teacher can infer on students’ familiarization and understanding of Poe’s rhetoric “trademark,” the teacher thinks of a literary work from a different source, let us say, one from the long list of fairy tales. Then the teacher asked a transfer question, “Imagine that Edgar Allan Poe wrote his version of the fairy tale story, ‘Jack and the beanstalk,’ you are making a critical review of his version of the story, what do you expect to see in his rhetoric quality?” This question prompts the students to bring their knowledge of Poe’s rhetoric style to a new domain, that is, a different literary piece with a different rhetoric quality. This question further encourages the students to thresh out only those relevant knowledge that must be transferred, and therefore, helps them account for their learning of a subject matter.
Predictive Question Asking predictive questions allows students to think in the context of a hypothesis. Through questions of this type, students infer on what is likely to happen given the circumstances in hand. In other words, students are compelled to think about the “what if” of the phenomenon under study, mindful of the circumstances on focus. This type of question has long been used in the natural sciences, but is certainly not for their exclusive use. In any subject area, we can let our students think scientifically. One of the ways to do so is to let them engage our predictive questions or to drive them to raise the same type of question in the class. Predictive questions prompt the students to go beyond the default condition and infer on what is likely to happen if some circumstances change. Here, students make use of higher levels of cognitive processing as they estimate probabilities.
Metacognitive Question The types of questions discussed above all focus on students’ cognitive processes. To bring students into the level of regulation over their own learning, we also need to ask metacognitive questions. Questions of this type allow students to think about how they are thinking, and learn about how they are learning your course lessons. Successful learners tend to show higher level of awareness of how they are thinking and learning. They show clear understanding of how they struggle with academic tasks, comprehend written texts, solve problems, or make decisions. A metacognitive question invites students to know how they know, and, thus, become more aware of the processes that take place within them while they are thinking and learning. In a math class, for instance, the teacher not only asks to solve a word problem but also to describe how the student is able to solve the word problem.
183
A. Think of a subject matter within your area of specialization. B. Make a rough plan as to how you might present the subject matter in an appropriate class (considering grade/year level). C. Formulate questions that will likely encourage your students to engage the subject matter. Try as much as possible to formulate questions in all types of questions discussed above. D. Justify why those questions fall under their respective types.
Lesson 3 Taxonomic Questioning
After trying your best to formulate questions for every type of questions discussed above, we will now bring you to the discussion on planning the questioning in terms of taxonomic structure. Questions differ not only in terms of types but also in terms of what cognitive processes are involved based on the taxonomy of learning targets you are using. For our students to benefit more from our questioning, it is necessary to plan our questioning taxonomically. In Chapter 2 of this book we learned about the different taxonomic tools for setting your learning intents or target. These tools also serve as frameworks for planning and constructing your questions. Because questioning influences the quality of students’ reasoning, the questions we ask our students to respond to must be pegged on certain levels of cognitive processes (Chinn, O’Donnell, & Jinks, 2000). For example, Bloom’s taxonomy provides a way of formulating questions in various levels of thinking, as in the following: Questions intended for knowledge should encourage recall of information. Such questions may be What is the capital city of…? or What facts does… tell? For comprehension, questions should call for understanding of concepts, such as What is the main idea of…? or Compare the… Questions at the level of application must encourage the use of information or concept in a new context, like How would you use…? or Apply… to solve… If analysis is desired where students are driven to think critically, the questions must focus on relationships of concepts and logic of arguments, such as What is the difference between…?” or How are…and…analogous? To encourage synthesis, questioning must focus on students’ original thinking and emergent knowledge, like Based on the information, what could be a good name for…? or What would…be like if…?
184
In terms of questioning at the level of evaluation, students are prompted to judge the ideas or concepts based on certain criteria. Questions may be like Why would you choose…? or What is the best strategy for…?
If you are to use the revised taxonomy where you need to consider both the knowledge and cognitive process dimensions, it is important that you first identify the knowledge dimension you wish to focus, and ask yourself, “What questions will be appropriate for every knowledge dimension?
A. Think about your understanding of the definitional meaning of each type of knowledge (factual, conceptual, procedural, metacognitive). B. Based on your understanding of their definitional meanings, discuss what kinds of questions to ask for each of the types of knowledge. C. Discuss your ideas with your teacher and/or classmates. D. Synthesize your understanding after sharing your thoughts and listening to those of others.
Your clear understanding of the kinds of questions to ask based on the types of knowledge in focus helps you to categorically focus on any of those types of knowledge, depending on what is relevant to your teaching and assessment at any given time. After anchoring your questions into a particular type of knowledge, the next step is to frame your question so that it conveys the relevant cognitive process needed for a successful learning of the subject matter. If your focus is factual knowledge, you can toss up different questions that vary according to the cognitive processes. You can raise a question on factual knowledge that necessitates the use of recall (remember) or synthesis (create), depending on your learning intents. You can navigate in the same way across the different levels of cognitive processing while anchoring on any other type of knowledge.
A. Based on a subject matter of your choice, formulate at least one factual knowledge question for each of the six levels of cognitive processes in the revised Bloom’s taxonomy (remember, understand, apply, analyze, evaluate, & create). B. Based on the same (or a different) subject matter, do the same for conceptual knowledge, then for procedural knowledge, then for metacognitive knowledge. C. Share your output in the class for discussion.
185
You may also try out on the alternative taxonomic tools discussed in Chapter 2, and see how you can brush up on your art of questioning while maintaining your track towards your learning intents. When you wish to verify the validity of your questions, always go back to the conceptual description of the taxonomy. It should be an important process as you build on your art of questioning so that, aside from its artistic sense, your questioning also becomes scientific in so far as teaching-and-learning process in concerned.
Lesson 4 Practical Considerations in Questioning
We now give you some tips in questioning. These tips are add-on elements to the items that have already been discussed in the preceding section of this chapter.
Consider your students’ interest Before you ask a question with your students, think if the question you are about to ask can arouse their interest in the subject matter. Think of a context that might interest your students and use that context as the backdrop of your question. Here, it is important that you know the “language” of your students. You should be able to anticipate their needs based on their developmental characteristics. Also, you should have an idea of their interest, such as the kinds or genres of music they enjoy listening, the computer games they play, or the kinds of sports they engage into, and so on. If you contextualize your question in these aspects, you know that your students are likely to engage your question.
Hold on to your targets As you did the “On-Task” exercises in the previous section of this chapter you realized that the questions we toss up in our classrooms must be anchored on our learning intents. Airasian (2000) contends that the questions we ask communicate to our students the topics and processes are important. To be on track, classroom questioning should be aligned to relevant instructional targets. Always remember to ask questions that do not only allow students to orchestrate their cognitive prowess but also those that scaffold them to think at the level you so desire. Make sure your questions are sensible as far as your learning targets are concerned. As much as possible ask questions that are both relevant to your learning intents and interesting to your students.
186
Expect for answers Perhaps common sense will tell you that whenever we ask a question, we always expect for a good answer. But reality has it that some teachers pose questions that are obscure and that relevant answers are hardly drawn from the students. If, with conscious effort, we expect for relevant answers, we always make sure that our questions are clear, and that they can be answered by our studies based on their capacities. To do this, we need to first understand our students’ the developmental characteristics, their actual capacities and aptitude. Knowing all these, we can toss up questions that match their capabilities. Finally, Do not ask a question that is only part of your script so that it only serves as a cue of what you will say next. Do not ask a question if you are not actually expecting your students to respond to you if you are not really interested to pick up on your students’ answers. Ask a question only if you truly intend to let your students respond to it.
Push your students farther While it is important to ask questions that match with students’ capacities, it is also vital that the questions we ask our students to answer challenge them to exhaust their cognitive resources so that if they realize that they are on the edge of their available knowledge, the question encourages them to think of possibilities. This exercise gives students the opportunity to discover new realms of knowledge that will be explored. Also, this will build on their scientific thinking of problematizing the existing knowledge by subjecting it to tentativeness so that more argument, more exploration, and more thinking become necessary resources for learning.
What do you now know about the art of questioning? Account for your understanding of the why’s and how’s of questioning. How important is the art of questioning in the assessment process? Reason out some benefits of developing the art of questioning on our assessment practices.
187
References:
Airasian, P. W. (2000). Assessment in the classroom: A concise approach. 2nd edition. USA: McGraw-Hill Companies. Chinn, C. A., O’Donnell, A. M., & Jinks, T. S. (2000). The structure of discourse in collaborative learning. Journal of Experimental Education, 69, 77-97. Mayer, R. E. (2002). The promise of educational psychology Volume II: Teaching for meaningful learning. NJ: Merrill Prentice Hall.
188
Chapter 7 Grading Students Chapter Objectives
1. 2. 3. 4. 5. 6.
Define grading in the educational setting of the Philippines. Explain grading as a process. Identify the different purposes of grading. Explain the rationales for grading. Reflect on the advantages and disadvantages of each grading rationale Reflect on when a rational for grading is appropriate or not.
Lessons
1. Defining Grading 2. The Purposes of Grading a. Feedback b. Administrative Purposes c. Discovering Exceptionalities d. Motivation 3. Rationalizing Grades a. Absolute/ Fixed Standards b. Norms c. Individual Growth d. Achievement Relative to Ability e. Achievement Relative to Effort
189
Lesson 1 Defining Grading
Effective and efficient way of recording and reporting evaluation results is very important and useful to persons concerned in the school setting. Hence, it is very important that students’ progress is recorded and reported to them, their parents and teachers, school administrators, counselors and employers as well because this information shall be used to guide and motivate students to learn, establish cooperation and collaboration between the home and the school and in certifying the students’ qualifications for higher educational levels and for employment. In the educational setting, grades are used to record and report students’ progress. Grades are essential in education such that it is through it that students’ learning can be assessed, quantified and communicated. Every teacher needs to assign grades which are based on assessment tools such as tests, quizzes, projects and so on. Through these grades, achievement of learning goals can be communicated with students and parents, teachers, administrators, and counselors. However, it should be remembered that grades are just a part of communicating student achievement; therefore, it must be used with additional feedback methods. According to Hogan (2007), grading implies (a) combining several assessments, (b) translating the result into some type of scale that has evaluative meaning, and (c) reporting the result in a formal way. From this definition, we can clearly say that grading is more than quantitative values as many may see it; rather, it is a process. Grades are frequently misunderstood as scores. However, it must be clarified that scores make up the grades. Grades are the ones written in the report cards of students which is a compilation of students’ progress and achievement all through out a quarter, a trimester, a semester or a school year. Grades are symbols used to convey the overall performance or achievement of a student and they are frequently used for summative assessments of students. Take for instance two long exams, five quizzes, and ten homework assignments as requirements for a quarter in a particular subject area. To arrive at grades, a teacher must be able to combine scores from the different sets of requirements and compute or translate them according to the assigned weights or percentages. Then, he/ she should also be able to design effective ways on how he/ she can communicate it with students, parents, administrators and others who are concerned. Another term not commonly used to refer to the process is marking. Figure 1 shows a graphical interpretation summarizing the grading process. Figure 1. Summary of Grading Process.
Separate Assessments Tests, Quizzes, Exam Projects, Seatworks, Worksheets…etc.
REPORTED Grades are communicated to teachers, students, parents, administrators, etc.
COMBINED Depends on the assigned weights/ percentages for each set of requirements.
TRANSLATED Combined scores are translated into scales with evaluative meaning.
190
Review Questions:
1. 2. 3. 4. 5.
Why is grading considered as a process? Explain the different steps that make up grading. Differentiate grades from scores. How are grades essential in the educational context? How can you use grades in different contexts?
191
Lesson 2 The Purposes of Grading
Grading is very important because it has many purposes. In the educational setting, the primary purpose of grades is to communicate to parents, and students their progress and performance. For teachers, grades of students can serve as an aid in assessing and reflecting whether they were effective in implementing their instructional plans, whether their instructional goals and objectives were met, and such. Administrators on the other hand, can use the grades of students for a more general purpose as compared to teachers, such that they can use grades to evaluate programs, identify and assess areas that needs to be improved and whether or not curriculum goals and objectives of the school, and state has been met by the students through their institution. From these purposes identified, the purposes of grading can be sorted out into four major parts in the educational setting. Feedback Feedback plays an important role in the field of education such that it provides information about the students’ progress or lack. Feedback can be addressed to three distinct groups concerned in the teaching and learning process: parents, students, and teachers. Feedback to Parents. Grades especially conveyed through report cards provide a critical feedback to parents about their children’s progress in school. Aside from grades in the report cards however, feedbacks can also be obtained from standardized tests, teachers’ comments. Grades also help parents to identify the strengths and weaknesses of their child. Depending on the format of report cards, parents may also receive feedbacks about their children’s behavior, conduct, social skills and other variables that might be included in the report card. On a general point of view, grades basically tell parents whether their child was able to perform satisfactorily. However, parents are not fully aware about the several and separate assessments which students have taken that comprised their grades. Some of these assessments can be seen by parents but not all. Therefore, grades of students, communicated formally to parents can somehow let parents have an assurance that they are seeing the overall summary of their children’s performance in school. Feedback to Students. Grades are one way of providing feedbacks to students such that it is through grades that students can recognize their strengths and weaknesses. Upon knowing these strengths and weaknesses, students can be able to further develop their competencies and improve their deficiencies. Grades also help students to keep track of their progress and identify changes in their performance. Personally, I feel that this feedback is directly proportional with the age and year level with the students such that grades are given more importance and meaning by a high school student as compared to a grade one student; however, I believe that the motivation grades can give is equal across different ages and year levels. Such that grade one students (young ones) are motivated to get high grades because of external rewards and high school students (older ones) are also motivated internally to improve one’s competencies and performance.
192
Feedback to Teachers. Grades serve as relevant information to teachers. It is through grades of students that they can somehow (a) assess whether students were able to acquire the competencies they are supposed to have after instruction; (b) assess whether their instruction plan and implementation was effective for the students; (c) reflect about their teaching strategies and methods; (d) reflect about possible positive and negative factors that might have affected the grades of students before, during and after instruction; and (e) evaluate whether the program was indeed effective or not. Given these beneficial purposes of grades to teachers, we can really say that teaching and learning is a two way interrelated process, such that it is not only the students who learn from the teacher, but the teacher also learns from the students. Through grades of students, a teacher can be able to undergo self- assessment and self- reflection in order to improve herself and be able to recognize relative effectiveness of varying instructional approaches across different student groups being observed and be flexible and effective across different situations. Administrative Purposes Promotion and Retention. Grades can serve as one factor in determining if a student will be promoted to the next level or not. Through the grades of students, skills, and competencies required of him to have for a certain level can be assumed whether or not he was able to achieve the curriculum goals and objectives of the school and/ or the state. In some schools, the grade of students is a factor taken into consideration for his/ her eligibility in joining extracurricular activities (performing, theater arts, varsity, cheering squads… etc.). Grades are also used to qualify a student to enter high school or college in some cases. Other policies may arise depending on the schools’ internal regulations. At times, failing marks may prohibit a student from being a part of the varsity team, running for officer, joining school organizations, and some privileges that students with passing grade get. In some colleges and universities, students who get passing grades are given priority in enrolling for the succeeding term, as compared to students who get failing grades. Placement of Students and Awards. Through grades of students, placement can be done. Grades are factors to be considered in placing students according to their competencies and deficiencies. Through which, teaching can be more focused in terms of developing the strengths and improving the weaknesses of students. For example, students who consistently get high, average and failing grades are placed in one section wherein teachers can be able to focus more and emphasize students’ needs and demands to ensure a more productive teaching learning process. Another example which is more domain specific would be grouping students having same competency on a certain subject together. Through this strategy, students who have high ability in Science can further improve their knowledge and skills by receiving more complex and advanced topics and activities at a faster pace, and students having low ability in Science can receive simpler and more specific topics at a slower pace (but making sure they are able to acquire the minimum competencies required for that level as prescribed by the state curriculum). Aside from placement of students, grades are frequently use as basis for academic awards. Many or almost all schools, universities and colleges have honor rolls, and dean’s list, to recognize student achievement and performance. Grades also determine graduation awards for the overall achievement or excellence a student has garnered through out his/ her education in a single subject or for the whole program he has taken.
193
Program Evaluation and Improvement. Through the grades of students taking a certain program, program effectiveness can be somehow evaluated. Grades of students can be a factor used in determining whether the program was effective or not. Through the evaluation process, some factors that might have affected the program’s effectiveness can be identified and minimized to improve the program further for future implementations. Admission and Selection. External organizations from the school also use grades as reference for admission. When students transfer from one school to another, their grades play crucial role for their admission. Most colleges and universities also use grades of students in their senior year in high school together with the grade they shall acquire for the entrance exam. However, grades from academic records and high stakes tests are not the sole basis for admission; some colleges and universities also require recommendations from the school, teachers and/ or counselors about students’ behavior and conduct. The use of grades is not limited to the educational context, it is also used in employment, for job selection purposes and at times even in insurance companies that use grades as basis for giving discounts in insurance rates. Discovering Exceptionalities Diagnosing Exceptionalities. Exceptionalities, disorders and other malfunctions can also be determined through the use of grades. Although the term exceptionality is often stereotyped as something negative, it has its positive sides such as giftedness and such. Grades play an essential role in determining these exceptionalities such that it is a factor to be considered in diagnosing a person. Through grades, intelligence, ability, achievement, aptitude, and other factors that are quite difficult to measure can be interpreted and therefore be given proper interventions and treatments when they fall out of the established norms. Counseling Purposes. It is through the grades of students that teachers can somehow seek the assistance of a counselor. For instance, a student who normally performs well in class suddenly incurs consecutive failing marks, then teachers who was able to observe this should be able to think and reflect about the probable reasons that caused the student’s performance to deteriorate and consult with the counselor about procedures she can do to help the student. If the situation requires skills that are beyond the capacity of the teacher, then referral should be made. Grades are also used in counseling when personality, ability, achievement, intelligence, and other standardized tests are being measured. Motivation Motivation can be provided through grades; most students study hard in order to acquire good grades; once they get good grades, they are motivated to study harder to get higher grades. Some students are motivated to get good grades because of their enthusiast to join extracurricular activities, since some schools do not allow students to join extra curricular activities if they have failing grades. There are numerous ways on how grades serve as motivators for students across different contexts (family, social, personal…etc.). Thus, grades may serve as one of the many motivators for students.
194
Review Questions:
1. What are the different purposes of grades in the educational context? Explain each. 2. How do grades motivate you as a student? 3. How does feedback affect your performance in school? Activity
1. Ask 10-15 grade 1 students on how grades motivate them. 2. Ask 10-15 high school or college students on how grades motivate them. 3. Tabulate the data you were able to gather and compare how grades motivate students at different levels. 4. Report your findings in class.
195
Lesson 3 Rationalizing Grades
Attainment of educational goals can be made easier if grades could be accurate enough to convey a clear view of a student’s performance and behavior. But the question is what basis shall we use in assigning grades? Should we grade students in relation to (a) an absolute standard, (b) norms or the student’s peer group, (c) the individual growth of each student, (d) the ability of each student, or (e) the effort of the students/? Each of these approaches has their own advantages and disadvantages depending on the situation, test takers, and the test being used. It is expected for teachers to be skillful in determining when to use a certain approach and when not to. Absolute Standards. Using absolute standards as basis for grades means that students’ achievement is related to a well defined body of content or a set of skills. For a criterionreferenced measurement, this basis is strongly used. An example for a well defined body of content would be: “Students will be able to enumerate all the presidents of the Philippines and the corresponding years they were in service.” An example for a set of skills would be something like: “Students will be able to assemble and disassemble the M16 in 5 minutes.” However, this type of grading system is somewhat questionable when different teachers make and use their own standards for grading students’ performance since not all teachers have the same set of standards. Therefore, standards of teachers may vary across situations and is subjective according to their own philosophies, competencies and internal beliefs about assessing students and education in general. Hence, this type of grading system would be more appropriate when it is used in a standardized manner. Such that a school administration or the state would provide the standards and make it uniform for all. An example for tests wherein this type of grading is appropriate would be standardized tests wherein scales are from established norms and grades are obtained objectively. Norms. The grades of students in this type of grading system is related to the performance of all others who took the same test; such that the grade one acquires is not based on set of standards but is based from all other individuals who took the same test. This means that students are evaluated based on what is reasonably expected from a representative group. To further explain this grading system, take for instance that in a group of 20 students, the student who got the most number of correct answers- regardless whether he got 60% or 90% of the items correctly, gets a high grade; and the student who got the least number of correct answersregardless whether he got 10% or 90% of the items correctly, would get a low grade. It can be observed in this example that (a) 60% would warrant a high grade if it was the highest among all the grades of participants who took the test; and (b) a 90% can possibly be graded as low considering that it was the lowest among all the grades of the participants who took the test. Therefore, this grading system is not advisable when the test is to be administered in a heterogeneous group because results would be extremely high or extremely low. Another problem for this approach is the lack of teacher competency in creating a norm for a certain test which lets them settle for absolute standards as basis for grading students. Also, this approach would require a lot of time and effort in order to create a norm for a sample. This approach is also known as “grading on the curve.”
196
Individual Growth. The level of improvement is seen as something relevant in this type of grading system as compared to the level of achievement. However, this approach is somewhat difficult to implement such that growth can only be observed when it is related to grades of students prior to instruction and grades after the instruction, hence, pretests and posttests are to be used in this type of grading system. Another issue about this type of grading system is that it is very difficult to obtain gain or growth scores even with highly refined instruments. This system of grading disregards standards and grades of others who took the test; rather, it uses the quantity of progress that a student was able to have to assess whether he/ she will have a high grade or a low grade. Notice that initial status of students is required in this type of grading system. Achievement Relative to Ability. Ability in this context refers to mental ability, intelligence, aptitude, or some familiar constructs. This type of grading is quite simple to understand such that a student with high potential on a certain domain I expected to achieve at a superior level, and the student with limited ability should be rewarded with high grades if the student exceeds expectations. Achievement Relative to Effort. Similarly, this type of grading system is relative to the effort that students exerted such that a student who works really diligently, responsibly, complying to all assignments and activities, doing extra credit projects and so on should receive a high grade regardless of the quality of work he was able to produce. On the contrary, a student who produces a good work will not be merited a high grade if he was not able to exert enough effort. Notice that grades are based merely on efforts and not on standards. As mentioned earlier, each of these approaches in arriving at grades have their own strengths and limitations. Using absolute standards, one can focus on the achievement of students. However, this approach fails to state reasonable standards of performance and therefore can be subjective. Another drawback in this approach would be the difficulty in specifying clear definitions; although this difficulty can be solved, it can never be eliminated. The second approach is appealing such that it ensures realism that is at times lacking in the first approach. It avoids the problem of setting too high or too low standards. Also, situation wherein everyone fails can be prevented. However, the individual grade of students is dependent on the others which is quite unfair. A second drawback to this kind of approach is that how will the teacher choose the relevant group; will it be the students in one class, students in the school, students in the state, or students in the past ten years? Answers to these questions are essential to be answered by a teacher to have a rationale if achievement in relation to other students. Another difficulty for this approach is the tendency of encouraging unhealthy competitions; if this happens, then students become competitors with one another and it is not a good environment for teaching and learning. The last three approaches can be clustered such that they have similar strengths and weaknesses. The strength of theses three is that they focus more on the individual, making the individual define a standard for himself. However, these three approaches have two drawbacks; one is that conclusions would seem awkward, or if not, detestable. For example, a student who performed low but was able to exert effort gets a high grade; but a student who performed well but exerted less effort got a lower grade. Another example would be: Ken with an IQ of 150 gets
197
a lower grade compared to Tom with an IQ of 80 because Ken should have performed better; while we were pleasantly amazed with Tom’s performance… Kyle starting with little knowledge about statistics learned and progressed a lot. Lyra, who was already proficient and knowledgeable in statistics, gained less progress. After the term, Kyle got a higher grade since he was able to progress more; although it can be clearly seen that Lyra is better than him. Conditions of these types make people feel uncomfortable with such conclusions. The second drawback would be reliability. Reliability is hard to obtain when we use differences as basis for grades of students. In the case of effort, it is quite hard to measure and quantify it, therefore, it is based on subjective judgments and informal observations. Hence, resulting grades from these three approaches when combined to achievement are somewhat unreliable. Table 1 presents a summary of the advantages and disadvantages of the different rationales in grading. Table 1. Advantages and Disadvantages of Different Rationales in Grading. Rationale
Advantages
Disadvantages______
Absolute Standards
- Focuses exclusively on achievement
- Standards are opinionated - Difficulty in getting clear definitions
Norms
- Ensures realism - Always clear to determine
- Individual grades depend on - Choosing relevant group
Improvement, Ability, Effort
- Concentration on individual
- Awkward conclusions - Reliability
Review Questions:
1. What rationale for grading do you feel most effective? 2. What rationale for grading is used in your school? Is it uniform across different subjects? 3. When is each of the rationales effective to apply? 4. When is each of the rationales ineffective to apply? Activity Conduct a survey to your 20 teachers about: 1. What for them is the most effective rationale for grading and why? 2. What for them is the most ineffective rationale for grading and why? 3. When is each of the rationales most effective to use? 4. Present the results in a table form. 5. Make a reflection paper about the results you gathered from twenty teachers.
198
References
Hogan, T. P. (2007). Educational assessment a practical introduction. United States of America: John Wiley & Sons, Inc. Popham, J. W. (1998). Classroom assessment: What teachers need to know (2nd ed.). Needham Heights, MA: Allyn & Bacon. Brookhart, S. M. (2004). Grading. Upper Saddle River, New Jersey: Pearson Education Inc. Oriondo, L. L. & Dallo-Antonio, E. M. (1984). Evaluating educational outcomes. Quezon City: Rex Printing Company Inc.
199
Chapter 8 Standardized Tests
Objectives 1. 2. 3. 4.
Characterize standardized tests. Determine the classification of tests. Follow procedures in constructing norms. Follow standards in test administration and preparation.
Lessons 1 2 3
What are standardized tests? Interpreting Test Scores Through Norm and Criterion Reference Standards in Educational and Psychological testing
200
Lesson 1 What are Standardized Tests?
A test is a tool used to measure a sample of behavior. Why did we say “a sample” and not the entire behavior? A test can only measure part of a behavior. A test CANNOT measure the entire behavior of a person, or characteristics measured. For example in a personality test, you cannot test the entire personality. In t case of NEO-PI, the subscale on extrovertness can only measure part of extrovertness. As an implication, during pre-employment testing, before an applicant is accepted they administer a series or battery of tests to well represent the behavior that needs to be uncovered. In school admission, the university or college require student applicants’ grades, entrance exam, essay, recommendation letter, and bioprofile to decide on the suitability of the student. A test can never measure everything. There are proper uses of tests. What do you need to consider in a test? As discussed in chapter 3 a test shoud be valid, reliable and can discriminate ability before one should use it. Validity means if the test is measuring what it is suppose to measure. Reliability means if the test scores are consistent when the same test or a test with another test. Discrimination is the ability of the test to determine who learned and who does not. What is the purpose of standardization? The primary purpose of standardization is to (1) facilitate the development of tools; and (2) to ensure that results from a test are indeed reliable and therefore can be used to assign values/ qualities to attributes being measured (through the established norms of a said test). What makes tests standardized? The unique characteristic of a standardized test which differentiates it from other tests are: (1) Uniform procedures in test administration, and scoring, and (2) having establishment of norms. Uses of Tests 1) Screen applicants for jobs and educational/training programs 2) Classification and placement of people in different contexts 3) Educational, vocational, and personal counseling and guidance 4) Retention/dismissal/promotion/rotation of students/employees in programs/jobs 5) Diagnosing and prescribing treatments in clinics/hospitals
201
6) Evaluating, cognitive, intra and interpersonal changes due to educational and psychotherapeutic programs 7) Conducting researcher on individual development over time and on effective of a new program Classifications of Tests Standardized VS. Non-Standardized. Standardized tests have fixed directions for scoring and administering. Can be purchased with test manuals, booklet, answer sheet. It was sampled to those who are considered in the norm. Non-Standardized or teacher-made test is intended for classroom assessment. it is used for classroom purposes. It intends to measure the behavior in line with the objectives of the course. Examples are Quiz, Long Test, Exams, etc. Can a teacher made test become a standardized test? Yes, as long as it is valid, reliable, and has a norm Individual Tests VS. Group Tests. Individual Tests are administered to one examinee at a time. Used for special populations such as children and people with mental disorders. Examples are Stanford-Binet and WISC. Group Tests are administered to many examinees at a time. Examples are classroom Tests. Speed VS. Power. Speed test consists of easy items but time is limited. Power consists of few pre-calculated difficult item and time is also limited. Objective VS. Non-Objective/Subjective. Objective tests have fixed objective scoring standards and commonly has right and wrong answers. Non-Objective/Subjective tests have variation in responses and with no fixed answers. Examples are essays and Personality Tests. Verbal VS. Non-Verbal Tests. Verbal consists of vocabulary and sentences. Examples are Math test with characters. Non-Verbal consists of puzzles and diagrams. Examples are Abstract reasoning and projective tests. Performance Test requires to manipulate objects. Cognitive VS. Affective. Cognitive measures the process and products of natural ability. Example are intelligence, aptitude, memory, problem solving. Achievement Test assesses what has been learned in the past. Aptitude Test focuses in future and what the person is capable of learning. Example is Mechanical Aptitude Test, Structural Visualization. Affective assesses interest, personality, and attitudes, non-cognitive aspects.
202
Lesson 2 Interpreting Test Scores Through Norm and Criterion Reference
The process of test standardization involves having uniformity of procedure and an established norm. Uniformity of procedure means that the testing conditions must be same for all. Directions are formulated. Time limit is considered. Preliminary demonstration on administering the test. In administering consider the rate of speaking and tone of voice, inflection, pauses, and facial expression. Inflection is a change in the form of a word that reflects a change in grammatical function. Test administration should be uniform to maintain constancy across testing groups and minimizing measurement errors. Having an establishing norms for a test means obtaining a normal or average performance in the distribution of scores. A normal distribution is obtained by increasing the sample size. A norm is a standard and it is based on a very large group of samples. Norms are reported in the manual of standardized tests. Aside from the norm the test manual includes description of the test, how to administer the test, reminders before testing, dialogues of the person administering the test, how to interpret the test A normal distribution found in the manual takes the shape of a bell curve. It shows the number of people within a range of scores. It also reports the percentage of people for particular scores. The norm is used to convert a raw score in to standard scores for interpretability. What is the use of a norm? (1) A norm is a basis for interpreting a test score (2) You use a norm to interpret a particular score There are two ways of interpreting scores: Norm-Reference and Criterion-Reference. Criterion reference is a given set of standards. The scores are then compared on the given criterion. For example, in a 20 item test: 16-20 high, 11-15 average, 6-10 poor, 0-5 low. In a criterion-reference the score is interpreted for a particular cut off scores. Most commonly the grading system in schools are criterion reference where 100-95 is outstanding, 90-94 is very good, 85-89 is good, 80 to 84 is satisfactory, 75 to 79 needs improvement, and 74 and below are poor. The interpretation for norm reference would depend on the distribution of scores of the sample. The mean and standard deviations are computed and it will approximate the middle area of the distribution. The standing of every individual in a norm reference is based on the mean ans standard deviation of the sample. Standardized tests commonly interpret scores using norm reference where they have standardized samples. The Normal Curve and Norms Creating norms are usually done by test developers, psychometricians, and other practitioners in testing. When a test is created, it is administer to a large group of individuals. This group of individuals are the target sample where the test is intended for. If the test can be used for a wide
203
range of individuals, then a norm for a specific group possessing that characteristic needs to be constructed. It means that a separate norm is created for males and females, for ages 11-12, 1314, 15-16, 17-18 and so on. There should be a norm for very kind of user for the test in order to interpret his position in a given distribution. A variety of norms is needed because one cannot use a norm that was made for 12 years old and use it for 18 years old because the ability of an 18 years old is different from the ability of a 13 years old. If a 21 years old need to take a test but you DO NOT have a norm for a 21 years old, then you have to create a norm for a 21 years old. There is a need to create norms for certain groups because the types of groups involved are different from one another in terms of curriculum, ability, etc. For example, majority of standardized tests used in the Philippine setting are from the west. This means that the content and norms used are based in that setting. Thus, there is a need to create norms specifically for Filipinos. Another concern in developing norms is that it expires across a period of time. Norms created in the 1960’s cannot be used to interpret the scores of test takers of 2008. Thus, a norm needs to be created every year. In creating a norm, the goal is to come up with a normal distribution of scores that is typical of a normal curve. A normal distribution is asymptotic and symmetrical. Asymptotic means that the two tails of the normal curve do not touch the base which extends to infinity. The sides of the normal distribution are symmetrical. The normal curve is a theoretical distribution of cases where the mean, median, and mode are the same and in which distances from the mean can be measured in standardized distances such as standard deviation units or z scores. The z-scores are standardized values transformed from distributions that are not distributed normally. There are 6 standard scores presented for each area in a normal curve. The z-score ranges from -3 to + 3, with a mean of 0 and the standard deviation is 1.
204
Steps in creating a norm Suppose that a general ability test with 100 items was constructed and it was pilot tested to 25 participants. The goal is to construct a norm to interpret scores of future test takers (Generally 25 respondents are not enough to create a norm). 96 83 59 64 73
74 80 68 87 67
64 92 76 71 68
50 85 75 81 70
76 91 69 83 75
1. Compute for the Range R = (highest score lowest score) + 1 R = (96-50) + 1 R = 47 2. Compute for the interval size (i) R i= 10 i=
47 10
i = 4.7 (5 will be the interval size) 3. Start the class interval with the score that is divisible to your interval size. The lowest score which is 50 is divisible by 5 (interval size), so the class interval can start at 5.
205
4. Create the Frequency Distribution Table (FDT) Class interval (ci)
Tally
Frequency (f)
95-99 90-94 85-89 80-84 75-79 70-74 65-69 60-64 55-59 50-54
| || || |||| |||| |||| |||| || | |
1 2 2 4 4 4 4 2 1 1 Σf=25
Divisible by 5
Count the scores that belongs to each class interval
Should have a total of 25, N=25
Relative frequency (rf) 4% 8% 8% 16% 16% 16% 16% 8% 4% 4%
Cumulative Frequency (cf) 25 24 22 20 16 12 8 4 2 1
f N
Copy the lowest f then add each f going up
rf =
Cumulative Percentage (cP) 100 96 88 80 64 48 32 16 8 4
cf =
cf X 100 N
The frequency (f) and relative frequency (rf) indicates how many participants scored within a class interval. The cumulative percentage (cP) indicates the point in a distribution that has a given percent of the cases below it. For the example, an examinee who scored 87 means that 88% of the participants are below his score and there are 22% of the cases above his score.
206
f
midpoint When a histogram is created for the data set, it typifies a normal distribution. To determine if a distribution of scores will approximate a normal curve, there are indices to be assessed:
1. The mean and median should have approximately close values. 2. The computed skewness (sk) is close to zero 3. The computed kurtosis (K) is close to 0.256 Computation of the mean and median: X =
ΣX N
C50 = cb +
X =
1877 25
N (.5) − cf (i ) f
X = 75.08
C50 = 74.5 +
25(.5) − 12 (5) 4
C50 = 75.13
The 50% of the N=25 is 12.5, given this proportion, select from the cumulative frequency (cf) in the frequency distribution table that is close to 12.5 but will not exceed it. This value would be 12 which will then be used as cf in the formula. The f used is 4 because given a cf of 12 a frequency of 4 is still needed to approximate 12.4. The value 4 is taken as the frequency above 12. The i value is the interval size which is 5. To determine cb which is the class boundary, get the corresponding upper limit value of 12 in the class interval. This upper limit value is 74 (70 is the lower limit value). The boundary between 74 and the next limit which is 75 is 74.5, therefore 74.5 will be used as the cb. The value of the mean (75.08) and median (75.13) are close. It can be assumed that the distribution is normal.
207
Estimating Skewness. Skewness of a distribution refers to the tail of the distribution. If the tails are asymptotic, then the distribution is said to be normally distributed with skewness of 0. A distribution which is not normal is said to be skewed. If the tail goes to the right this type of skewness is positive. If the tail is on the left, then it is negatively skewed.
Notice that in a skewed distribution, the mean and median are not equal. In a positively skewed distribution, the mean is pulled by the extreme scores on the right having a higher value than the median ( X > C50 ) . While in the negative skewed curve, the mean is pulled by the extreme scores in the left side having a median with higher value ( X < C50 ) . Formula to determine Skewness: sk =
3( X − C50 ) sd
Where sd is the standard deviation, X is the mean, and C50 the median. In the previous section the mean and median are already computed with values 75.08 and 75.13, respectively. To determine the value of the standard deviation, the formula below is used:
( ΣX ) 2 N N −1
ΣX 2 − sd =
(1877) 2 25 25 − 1
143713 − sd =
sd = 10.78
Where ΣX is the sum of all scores, ΣX2 is the sum of squares, and N is the sample size. ΣX2 is obtained by squaring each score and then summate it. It will give a sum of 143713 from the given data. Substitute the values in the formula: sk =
3( X − C50 ) sd
sk =
3(75.08 − 75.13) 10.78
sk = -0.014
The value of the sk is almost 0 which indicates that the distribution is normal.
208
Estimating Kurtosis. Kurtosis refers to the peakedness of the curve. If a curve is peaked and the tails are more elevated, the curve is leptokurtic, if the curve is flattened then it is said to be platykurtic. A normal distribution is somewhat mesokurtic.
Formula for Kurtosis: Kurtosis =
QD P90 − P10
Q − Q1 Where QD is the quartile deviation QD = 3 , P90 is the 90th percentile, and P10 is the 2 10th percentile. The formula to determine the median can be used to determine percentile ranks P. The Q3 is also equivalent to P75 and Q1 is equivalent to P25. There are four estimates of percentiles needed to determine kurtosis, P75, P25, P90, and P10.
P75 = cb +
N (.75) − cf (i ) f
P75 = 79.5 +
25(.75) − 16 (5) P75 = 85.19 4
P25 = cb +
N (.25) − cf (i ) f
P25 = 64.5 +
25(.25) − 4 (5) 4
P25 = 67.31
P10 = cb +
N (.10) − cf (i ) f
P10 = 59.5 +
25(.10) − 2 (5) 2
P10 = 60.75
P90 = cb +
N (.90) − cf (i ) f
P90 = 89.5 +
25(.90) − 22 (5) P90 = 94.75 2
You can now compute for the Quartile deviation (QD): QD =
Q3 − Q1 2
QD =
85.19 − 67.31 QD = 8.95 2
209
You can now compute for the Kurtosis: Kurtosis =
QD P90 − P10
Kurtosis =
8.95 94.75 − 60.75
Kurtosis = -0.26
The distribution approximates a normal since the kurtosis value is exactly 0.26.
Interpreting areas in the norm How many participants are there below a score a 94 in the test? A score of 94 correspond approximately a percentile rank of 96%. Get the 96% of the total N which is 25 to determine the number of participants 25(.96) = 24. This means that there are 24 cases below a score of 94.
What is the standard score corresponding to a score of 94? Locate this score in the normal curve. X−X is used. Just replace To convert a raw score to a standard z-score the formula z = sd the values in the formula where X is the given score. We use the given data set where the X =75.08 and sd = 10.78. z=
94 − 75.08 10.78
z = 1.76
A z-score of 1.76 is at this point in the distribution which is raw score of 94.
Other standard Scales in a normal distribution
210
Notice that the z-score has a mean of 0 and standard deviation of 1. A T score has a mean of 50 and a standard deviation of 10. For the other scales:
CEEB score ACT Stanine
Mean 500 15 5
Standard Deviation 100 5 2
Convert a raw score of 94 into T score, CEEB, ACT, and stanine. Given the z value of 1.76 for a raw score of 94 just multiply the z with the standard deviation of the standard score then add the mean value. T score = z (10) + 50 CEEB = z (100) + 500 ACT = z (5) + 15 Stanine = z (2) + 5
T score = 1.76 (10) + 50 CEEB = 1.76 (100) + 500 ACT = 1.76 (5) + 15 Stanine = 1.76 (2) + 5
T score = 67.6 CEEB = 676 ACT = 23.8 Stanine = 8.52
A raw score of 94 has an equivalent 67.6 T score, 676 CEEB, 23.8 ACT, and 8.52 stanine. Once a score is converted into a standard score, a score can be interpreted based on its position in the normal curve. For example a raw score of 94 is said to be above the average given that its location surpassed the area of the mean. Areas of the Normal Curve
211
The normal distribution since it is symmetrical has constant areas. When a cutoff is made using the z-score the following areas are as follows.
The areas show that from the mean to a z score of 1 the area covered is 34.13% which is also 1 standard deviation away from the mean. From a standard score of -1 to +3 a total area of 68.26% (34.13% + 34.13%) is covered. From -2 to +2, a total area of 95.44% is covered in the curve. From -3 to +3 a total area 99.72% is covered. The remaining areas of the normal curve is 0.13% each side. The approximate areas of the normal curve for every z-score is found in Appendix C of the book. For example, in a given raw score of 94, what is the area away from the mean? Given the z score of 1.76 for a raw score of 94, look for the value of 1.76 in Appendix C (first column, z score) gives a value of .4608 which is the area away from the mean. To illustrate, it means that the area occupied from the mean “0” to a z score of 1.76 occupies 46.06% of the normal distribution.
3.92%
The shaded area occupies 46.06% of the distribution from 0 to 1.76.
How many cases are within the 46.06 area of the distribution? Just multiply the area .4606 with N (.4606 X 25) gives 12 participants. What is the area above a z score of 1.76? To determine the area above 1.76, one way is to look at Appendix C on the area in smaller proportion. Locate the z value of 1.76 and the value corresponding to area in smaller
212
proportion is .0392. This means that the area remaining on the smaller right of the normal curve is 3.92%. Another solution would be to subtract the shaded area .4606 to .5 which is half of the distribution. The answer would be the same which is .039. Another solution would be to subtract the shaded area to the entire area of the curve (1 .4606) which gives a value of 0.539. You still need to subtract 0.5 for the remaining area (.5 .5394) which will give an area of .039. 1) How many cases are within the 68.26% of the Norm distribution. Multiply N= 25 to .6826. Therefore, 25 x .6826 gives 17 cases.
17.06 people ranging in the area of the normal distribution.
68.26%
2) Given a score of 87 and another score of 73. How many people are between the two scores? Convert 87 and 73 into z scores ( X =75.08, sd = 10.78). A score of 87 corresponds to a z score of 1.11, and a score of 73 corresponds to a z score of -0.19. A z score of 1.11 is located on the right side of the curve above the mean and a z score of -0.19 is on the left side below the mean because the negative sign. The areas away from the mean can be located for each z score and add these areas to determine the proportion. Then multiply this proportion with N=25 to determine the cases in between the two scores.
25-17= 8 cases
(.3643 + .0753) = .4396 x 25 = 11 cases
.0753
.3643
213
Summary of the Distinction between Criterion and Norm Reference Dimension
Purpose
Content
Item Characteristics
Score Interpretation
Criterion-Referenced Norm-Referenced Tests Tests To determine whether each student To rank each student with respect to has achieved specific skills or the concepts. achievement of others in broad areas of knowledge. To find out how much students know before instruction begins and To discriminate between high and low after it has finished. achievers. Measures specific skills which make up a designated curriculum. Measures broad skill areas sampled These skills are identified by from a variety of textbooks, syllabi, teachers and curriculum experts. and the judgments of curriculum experts. Each skill is expressed as an instructional objective. Each skill is tested by at least four Each skill is usually tested by less items in order to obtain an adequate than four items. sample of student performance and to minimize the Items vary in difficulty. effect of guessing. Items are selected that discriminate The items which test any given skill between high are parallel in difficulty. and low achievers. Each individual is compared with Each individual is compared with a other examinees and assigned a scorepreset standard for acceptable -usually expressed as a percentile, a achievement. The performance of grade equivalent other examinees is irrelevant. score, or a stanine.
A student's score is usually expressed as a percentage.
Student achievement is reported for broad skill areas, although some norm-referenced tests do report Student achievement is reported for student achievement for individual individual skills. skills.
214
Exercise (True False) 1. The mean of a score is equivalent to zero in a standard z score. (True False) 2. The mean and the median are equivalent to 0 in a normal curve. (True False) 3. The 68% percent of the normal distributions are 2 standard deviations away from the mean. (True False) 4. The entire area of the normal distribution is 100%. (True False) 5. The area in percentage from -3 to -2 of the normal distribution is 86.26% (True False) 6. The extreme area of the normal distribution throughout is 0.13%? (True False) 7. The area of the normal curve from +2 to -1 is 95.44 (True False) 8. The area from -2 to +1 is the equivalent +2 to +1. (True False) 9. The mode is found on zero in a normal distribution.
215
Lesson 3 Standards in Educational and Psychological testing
Controlling the Use of Tests There is a need to control the use of tests due to the issue on leakage. When this happens it will be difficult to determine abilities accurately. To control the use of test, proper considerations are ensured: The qualified examiner and procedure in test adminsitration. A person can be a qualified examiner provided that they undergo training in administering a particular test. The psychometrician is the one responsible for the psychometric properties and the selection of tests. The psychometrician also trains the staff on how to administer standardized tests properly. A qualified examiner need to follow instructions precisely by undergoing training or orientation to develop the skill of administering a test. The examiner needs to follow precisely the test manual. If the examiner largely deviates from the instructions, then it defeats the purpose of standardization. One of the distinct qualities of standardized measures is the uniformity of administration. Moreover, the lack of preciseness in following the instructions in the administration of the test can affect the results of the test. The examiner should have a thorough familiarity in the tests’ instructions. They should at least memorize their script even when they introduction themselves to the examinees. Careful control of testing condition is also important which concerns the environment of the testing rooms when taking the exam. If there are many groups who will take the exam, the condition should be the same for all. It includes the lightning, temperature, noise, ventilation, and facilities. The condition of the testing room can affect the test taking process. Proper checking procedure should also be taken into consideration. It should be decided whether the test will be checked by scanning via computer or will be checked manually. Second round of checking should also be done for verification if the checking is done accurately. There should also be proper interpretation of results. Some trained examiners have the skills to make a psychological profile out of the battery tests or several tests administered. The psychometrician is qualified to write a narrative integrating all test results. In some cases, the staffs are trained how to write psychological profiles especially if there are occasional test takers. Security of the Test Content Tests content should be restricted in order to forestall deliberate efforts to fake scores. The questionnaires can only be accessed by the psyhometricians. The staff, superiors, or anybody else is not allowed to have access of the tests. To avoid leakage and familiarity, the psychometrician can use different sets of standardized test which measure the same characteristics for different groups of test takers. Test results are confidential. The examiner is not allowed to show anybody the results of the exam other than the test taker and the persons who will use it for decision making. Test results are kept where it can only be accessible to the psychometrician and qualified personnel. The nature of the test should effectively be communicated to the test takers. It is important to dispel any mystery and misconception regarding the test. It should be clarified to the test takers what the test is for purposes of assessment and used for deciding whatever the test is intended for. The procedures of the test can be reported to test takers in case they are concerned.
216
It is essential for them to know that the test is reliable and valid. Moreover the examiner should also dispel the anxiety of the test takers to ensure that they will perform to the best of their ability. After taking the test, feedback on the result of the test should be communicated to the test takers. It is the right of the test takers to know the result of the test they took. The psychometrician is responsible for keeping all the records of the results in case the test takers look for it. Test Administration Before the test proper, the examiner should prepare for the test administration. The preparation involves are memorizing the script, and familiarization with the instructions and procedures. The examiner should memorize the exact verbal instruction especially the introduction part. However, there are some standardized test that does not require the examiner to memorize the instructions and procedures. Some tests permit the examiner to read the instruction and procedure from the manual. In terms of preparing for the test materials, it is advisable that the examiner prepares a day before the test taking day. The test examiner counts the test booklets, the answer sheets, pencils, prepare the sign boards, stopwatch, other materials, and the room itself. The room reservation should have been made one month before the test taking. The testing schedule are all prearranged. The room condition is fixed that includes the ventilation, air-conditions, and chairs. Thorough familiarity with specific testing procedure is also important and is done by checking the names of the test takers, the pictures in the test permit should match the examinees faces. Testing materials provided for administering the test should be tested if they are properly working such as the stopwatch. Advance briefing for the proctor is also done through orientation and training on how to administer the test. The examiner during the test is also responsible for reading the instructions carefully, take charge of timing, and in-charge of the group taking the exam. They should also prevent the test-takers from cheating. The examiner checks if the numbers of test-takers correspond with the test booklet number after the session. Make sure also that test takers follow instructions such as shading the circle if they are to shade it. In cases of questions that cannot be answered by the proctor, there is a testing manger nearby that can be consulted. For the testing condition, the environment should not be noisy. The examiner should be able to select good and suitable testing rooms that can facilitate a good testing environment for the test takers. The area or place where the test is administered should be restricted. Noise in the place should be regulated. Temperature in each room should be kept the same for all rooms. The room should be free of noise, lights should be bright enough, good seating facilities and other factors that can negatively affect the test takers as they are taking the exam should be controlled. There should also be special step to prevent distractions by putting signs saying outside like “examination going on”, the examiner can also lock the door, or ask assistants outside the room to tell people that test is going on in that area. Subtle testing conditions may affect performance on ability and personality tests like the tables, chairs, type of answer sheet, paper and pencil, computer administration.
217
Introducing the Test to test takers The test administrator should establish rapport with the test takers. Rapport means the examiner’s efforts to arouse test takers interest in the test, elicit their cooperation, and encourage them to be appropriate in response. For ability test, encourage test takers to their best effort to perform well. For personality inventories, tell test takers to be frank and honest with their responses to the questions. For projective tests, inform test takers to fully report and make associations evoked by the stimuli without censoring or editing content. Generally, the test takers motivate respondents to follow instructions carefully. Testing Different Groups For preschool children, the test administrator has to be friendly, cheerful, relaxed, short testing time, interesting tasks, flexible in scoring. Examples are demonstrated to children on how to answer each test type. For grade school students, the test administrator should appeal to their competitive side, and their desire to do well. For the educationally disadvantaged, they may not be motivated in the same way as the usual test takers and the examiner should adapt to their needs. Nonverbal tests are used for deaf examinees and those who are not able to read and write. Oral tests should be given to examinees who are having difficulty in writing. For the emotionally disturbed, test administrators should be sensitive to difficulties the test takers might have while interpreting scores. Testing should occur when these examinees are in the proper condition. For adults, test administrators should sell purpose of test, convince the test taker it’s for their own interest. Examiner variables such as age, sex, ethnicity, professional/socio-economic status, training, experience, personality characteristics and appearance affects the test takers. Situational variables such as unfamiliar/stressful environment, activities before the test, emotional disturbance, and fatigue also affect the test takers. Examples of Standardized Tests Intelligence tests IPAT Culture-Fair Test of “g” The Culture-Fair Test of “g” is a measure of individual intelligence in a manner designed to reduce the influence of verbal fluency, cultural climate, and educational level. It can be administered in individuals or in groups. It is a non-verbal test and requires only that examinees be able to perceive relationships on shapes and figures. It has subtests including series, classification, matrices, and conditions. There are also three scales of this test. The Scale 1 is intended for children with ages 4-8 years old and older mentally handicapped people, while scale 2 and 3 are required to be administered in groups. Reliability was obtained and all coefficients are quite high and have been evaluated across large and widely diverse samples. The difference in level of reliability between the short from and full test (Form A and B) are sufficiently large to
218
warrant administration of the full test. Scale 2 reliability coefficients are .80-.87 for the full test and .67 to .76 for the short form. Scale 3 on the other hand has a reliability coefficient of .82-.85 for the full test and .69 to .74 for the short form. The validity used was construct and concurrent validity. Construct validity in the Scale 2 reported .85 for the full test and .81 for the short form. For the concurrent validity of Scale 2 reported .77 for full test and .70 for short form. In the Scale 3, construct validity reported was .92 for full test and .85 for short form while concurrent validity reported .65 for the full test and .66 for the short form. The standardization was done for both scales. In scale 2 4, 328 males and females were included from the varied regions of US and Britain and For Scale 3 3,140 American high school students participated from first to fourth year and young adults. Otis Lennon Mental Ability Test This test was developed by Arthur Otis and Roger Lennon and was published by the Harcourt Brace and World, Inc. in New York on 1957. This test was designed to provide a comprehensive assessment of the general mental ability for the students in American schools. It is also developed to measure the student’s facility in reasoning and in dealing abstractly with verbal, symbolic, and figural test. The content sampling includes a broad range of mental ability. It is important to take note that it does not intend to measure the innate mental ability of the students. There are 6 levels of Otis Lennon Mental Ability Test to ensure the comprehensive and efficient measure of the mental ability available or already developed among students in Grade K-12. The Primary Level I is intended for the students in the last half of kindergarten, Primary Level II for the first half of grade 2, elementary I, for the half of grade 2 through grade 3, Elementary II for Grade 4-6, Intermediate for grade 7 to 9 and Advance for grade 10-12. The norm was obtained by getting 200,000 students from 117 school systems in the 50 states participated in the National Standardization program. There were 12000 pupils from grade 1-12 while 6000 were from kindergarten. For the reliability, Split-half was used in which the computed reliabilities range from .93 (Elem I) to .96 (intermediate). KR#20 or the KuderRichardson also obtained above .93 (Elem I) to .96 (Intermediate) for reliability coefficients. Still in the alternate forms of reliability range from .89 (Elem II) to .94 (Intermediate) for reliability coefficients. As for the validity, school grades and scores on achievement test are computed. Moreover the relationship between OLMAT and other accepted mental ability and aptitude test were computed. Otis Lennon School Ability Test This test was developed by Arthur Otis and Roger Lennon and was published by the Harcourt Brace and Jovanovich, Inc. in New York on 1979. It was developed to give an accurate and efficient measure of the abilities needed to attain the desired cognitive outcomes of formal education. It intends to measure the general mental ability or the Spearman’s “g”. It was modified by Vernon based on the postulate two major factors or components of “g” which are the verbal-educational and practical mechanical. However, this test focused on the verbaleducational factor through a variety of tasks that call for the application of several processes to verbal, quantitative and pictorial content. OLSAT was organized in five levels which includes Primary Level I for grade 2 students, Primary level II for grades 2 and 3, Elementary for Grades 4 and 5, Intermediate, for grades 6-8 and Advance for grades 9 through 12. Each level is
219
designed to obtain reliable and efficient measurement to most students in which it is intended. For each level, there are two parallel forms of the test; the Forms R and S were developed. Items in these two forms are balanced in terms of content, difficulty and discriminatory power. These two forms also obtained comparable results. A norm composed of 130000 students in 70 school systems enrolled in Grades 1-12 from American schools was used for standardization. For the reliability of the test, Kuder-Richardson yielded .91 to .95 reliability coefficients. Test-retest reliability was also utilized and obtained .93 to .95 reliability coefficients. Lastly, standard error of measurement was also computed wherein 2/3 of scores fell within +/- 1 standard error of measurement from “true scores” and 95% fell within standard error of measurement from “true scores”. For the validity, the OLSAT was compared to teacher’s grade and got .40- .60 and median of .49. OLSAT was also correlated to Achievement test scores. Raven’s Progressive Matrices This test was originally developed by Dr. John C. Raven and was published by the U.S. Distributor: The Psychological Corporation in 1936. It is a test of abstract reasoning which is a multiple choice type. It was designed to measure the ability of a person to form perceptual relations. Moreover it intends to measure a person’s ability to reason by analogy independent of language and formal schooling. This test is a measure of Spearman's g. It is consisting of 60 items which are arranged in five sets (A, B, C, D, & E) of 12 items each. Each item contains a figure with a missing piece. There are either six (sets A & B) or eight (sets C through E) alternative pieces to complete the figure, only one of which is correct. Each set involves a different principle or "theme" for obtaining the missing piece, and within a set the items are roughly arranged in increasing order of difficulty. The raw score is converted to a percentile rank through the use of the appropriate norms. This test is intended for people with age ranging from 6 up to adult. The matrices are offered in three different forms for participants of different ability which includes the Standard Progressive Matrices, the Colored Progressive Matrices, and the Advanced Progressive Matrices. The Standard Progressive Matrices were the original form of the matrices and were first published in the year 1938. This test comprises five sets (A to E) of 12 items each with items within a set becoming increasingly difficult. This requires ever greater cognitive capacity in order to encode and analyze information. All of the items are presented in black ink on a white background. There is also Colored Progressive Matrices Designed for younger children, the elderly, and people with moderate or severe learning difficulties. This test is consists of sets A and B from the standard matrices, with a further set of 12 items inserted between the two, as set Ab. Most of the items are presented on a colored background so that the test will appear visually stimulating for participants. On the other hand the very last few items in set B are presented as black-on-white so that if participants exceed the tester's expectations, transition to sets C, D, and E of the standard matrices is eased. Another form is Advanced Progressive Matrices which contains 48 items, presented as one set of 12 (set I), and another of 36 (set II). Items here are also presented in black ink on a white background. Items become increasingly difficult as progress is made through each set. The items in this form are appropriate for adults and adolescents of above average intelligence. The last two forms of matrices has been published on were published in 1998. In terms of establishing the norms, the standard sample included are: British children between the ages of 6 and 16; Irish children between the ages of 6 and 12; military and civilian subjects between the ages of 20 and 65. Some more others includes sample Canada, the United States, and Germany. The two main factors of Raven's Progressive
220
Matrices are the two main components of general intelligence (originally identified by Spearman): Eductive ability (the ability to think clearly and make sense of complexity) and reproductive ability (the ability to store and reproduce information). To determine reliability, the split-half method and KR20 estimates values ranging from .60 to .98, with a median of .90. Testretest correlations was also used and obtained coefficients range from a low of .46 for an elevenyear interval to a high of .97 for a two-day interval. The median test-retest value is approximately .82. Raven provided test-retest coefficients for several age groups: .88 (13 yrs. plus), .93 (under 30 yrs.), .88 (30-39 yrs.), .87 (40-49 yrs.), .83 (50 yrs. and over). For test validity the Spearman used the SPM to be the best measure of g. Through the evaluation using factor analytic methods which were used to define g initially, the SPM comes as close to measuring it as one might expect. Majority of studies which have factor analyzed the SPM along with other cognitive measures in Western cultures report loadings higher than .75 on a general factor. Moreover, concurrent validity coefficients between the SPM and the Stanford-Binet and Weschler scales range between .54 and .88, with the majority in the .70s and .80s. SRA Verbal This test is a general ability test which measure the individual’s overall adaptability and flexibility in comprehending and following instructions and in adjusting to alternating types of problems. It is designed to use on both school and industry. It has two forms, A and B that can also be sued at all educational levels from junior high school to college at all employee levels from unskilled laborers to middle management. However, it is intended only for persons with familiarity on English Language. To determine the general ability of persons who speak foreign language or of illiterates, a non-verbal or pictorial test should be used. The items in this test has two types, the vocabulary (linguistic) and arithmetic reasoning (quantitative). This test is intended for 12 to 17 years old. Reliability was determined and reported that the coefficients are in the high .70s for all the scores- linguistic, qualitative and total. The means were also found to be very similar. For the validity of the test, SRA is correlated with the other tests particularly in the HS placement Test (r=.60) and in Army General Classification Test (r=.82). Watson Glaser Critical Thinking Appraisal This test was designed to measure the critical thinking of a person. This test was a series of exercises which require the application of score of the important abilities involved in thinking critically. It includes problems, statements, arguments, and interpretations of data similar to those which a citizen in democracy might encounter in daily life. It has two forms, the Ym and the Zm which also consist of 5 subtests. These subtests were designed to measure different and interdependent aspects of critical thinking. There were 100 items and it is not a test of speed but a test of power. The five subtests are inference, recognizing assumptions, deduction, interpretation and evaluation of arguments. Inference consists of 10 items and the students are display the ability to discriminate among the degrees of truth or falsity of inferences drawn from given data. The recognizing assumption (16 items) on the other hand allows the students to recognize unstated assumptions or presuppositions which are taken in given statements or assertions. Next, deduction (25 items) tests the ability to reason deductively from given statements or premises and to recognize the relation of implication between prepositions. Fourth, interpretation, measures the ability to weigh evidence and to distinguish between generalizations
221
from given data are not warranted beyond a reasonable doubt and generalization which although not absolutely obtain or necessary do seem to be warranted beyond a reasonable doubt. Lastly, evaluation of arguments measures the ability to distinguish between arguments which are strong and relevant and those which are weak or irrelevant to a particular question or issue. For the standardization of the test, norm was set. With this, 4 grade levels were included which are grades 9, 10, 11 and 12. There was a total of 20,312 students participated. High schools had to be a regular public institution in a community of 10 000-75000 with a minimum of 100 students. This was done to avoid the biasing influences associated with extremely small schools and with specialized High school found in some very large systems. The reliability was determined using the split-half. The computed reliability coefficients were .61, .74, .53, .67 and .62 for are inference, recognizing assumptions, deduction, interpretation, and evaluation of arguments respectively in the Ym form. While for the Zm form, the reliability coefficients were .55, .54, .41, .52 and .40 for are inference, recognizing assumptions, deduction, interpretation, and evaluation of arguments respectively. Validity was then determined through content and construct validity. The indication for the validity was the extent to which the critical thinking appraisal measures a sample of specified objective of such instructional programs. Moreover, for the construct validity, various forms of test intercorrelation obtained .21-.50 and .56-.79 was the correlation coefficient computed for the correlation of the subtests to the appraisal as a whole. Achievement Tests Metropolitan Achievement Test This test was designed to provide an accurate and dependable data concerning the achievement of the students on important skills and content areas of the school curriculum. This aims to base on the theories that achievement test should asses that is being taught in the classrooms and has been extended to include the first half of kindergarten and Grade 10-12. It is a two-component system of achievement evaluation both designed to obtain both normreferenced and criterion referenced information. The first one is the instructional component which is designed for classroom teachers and curriculum specialists. This is an instructional planning tool that provides prescriptive information in the educational performance of individual students in terms of specific instructional objectives. There is a separate instructional battery under this which includes reading, mathematics, and language all available in JI and KI forms. The other one is the survey component which provides the classroom teacher with considerable information about the strengths and weaknesses of the students in the class in the importance skill and content areas of the school curriculum. Under this are 8 overlapping batteries covering the age range from K-12. This also includes reading, mathematics, and language. The norm was set and participants were selected to represent the national population in terms of school system enrolment, public versus non-public school affiliation. Geographic design, socio-economic status and ethnic background. There were 550 students and there were 10% public schools from the Metropolitan Population and 10% also from the national population. For the Socio-economic status, 54% were from metropolitan, and 52% from national population, all adults graduated and high school. Reliability was computed using KR#20 and obtained .93 for reading, .91 for mathematics, .88 for language. The basic battery was .96. Also, Standard error of measurement was also computed and yield 2.8 for reading, 2.9 for mathematics, 3.4 for language and the basic battery is 5.3. Validity was determined through a content validity with a belief that the objective
222
and item should corre4spind to the school curriculum. With this in mind, compendium of instructional objectives was made available. Stanford Achievement Test This test was designed by Gardner, Rudman, Karlson, & Merwin in 1981. This test was a series of comprehensive test which is developed to assess the outcomes learning at different levels in educational sequences. This measures the objectives of the general education from kindergarten through first year college. Its series include SESAT Stanford Early School Achievement Test and TASK Stanford Test of Academic Skills. SAT is intended for primary, intermediate, and junior high school. It assesses the essential learning outcomes of the school curriculum. It was first established in 1923 and undergone several revisions until 1982. These revisions were done to have a close match between test content and learning practices, to provide norms that will have an accurate reflection of the performance of students in different grade levels and achieve modern ways of interpreting the scores which result in improvement in measurement technology. SESAT is for children in kindergarten and grade 1. This test measures the cognitive development of children upon admission and entry into school in order to establish a baseline where learning experiences may best begin. On the other hand, TASK was intended for grade 8 to 13 students (first year college). This intends to measure the basic skills. The level I of TASK is for grades 8-12 which measures the competencies and skills that are desired by the adult social level, while Level II is for grades 9-13 and measures the skills that are requisite to continued academic training. SAT contains 8 subtests which include reading comprehension, vocabulary, listening comprehension, spelling, language, concepts of numbers, math computations, math applications, and science. Reading comprehension is the measure of understanding skills wherein textual (typical found in books), functional (printed found in daily life), and recreational (reading for enjoyment such as poetry and fiction) were included. Vocabulary is the measure of the pupil’s language competence without having to read prior the test. Listening comprehension is the subtest which evaluates the ability of the student to process information that has been heard. Spelling tests the ability of the student to identify the misspelled words from a group of four words. Language test has three parts which are the proper use of capital letters, use of punctuation marks, and appropriate use of the parts of speech. Concept of number includes the understanding of the student with the basic concepts about numbers. Math computations include the multiplication and division of whole numbers, operations to fractions, decimals and percents. Math application tests the student’s ability to apply the concepts they have learned to problem solving. And lastly, science, measures the ability of the students to understand the basic physical and biological sciences. One of the items in SAT under vocabulary is “when you have a disease, you are ____” a. sick, b. rich, c. lazy, d. dirty. Te reliability of the test was obtained through internal consistency, KR#20 (computed r= .85- .95), standard error of measurements and alternate forms of reliability. For the validity, the test content was compared with the instructional objectives of the curriculum.
Aptitude Tests
223
Differential Aptitude Test This test was designed to meet the needs of the guidance counselors and consulting psychologists, whose advise and ideas where sought in planning for a battery which would meet the accurate standards and be practical on daily use in schools, social agencies, and business organizations. The original forms (forms A and B) were developed in 1947 with the aim to provide an integrated scientific and well-standardized procedure for measuring the abilities of the boys and girls in grade 8-12 for the purposes of educational and vocational guidance. It was primarily for junior and senior high school. It can also be used in educational and vocational counseling of young adult out of school and selection of employees. This test was revised and restandardized in 1962 for the forms L and M and in 1972 for forms S and T. Included in the battery of test for DAT are verbal reasoning, numerical ability, abstract reasoning, clerical speed and accuracy, mechanical ability, space relations and spelling. The verbal reasoning measures the ability of the student to understand concepts that were framed in words. Numerical ability subtest tests the understanding of the students of numerical relationships and facility in handling numerical concepts which includes arithmetic computations. Abstract Reasoning intends as a non-verbal measure of the student’s reasoning ability. Clerical speed and accuracy intends to measure the speed of response in simple perceptual task including simple number and letter combination. Mechanical ability test is the constructed version of the Mechanical Comprehensive Test (but it is easier) and measure mechanical intelligence. Space and relations measure the ability to deal with concrete materials through visualization. Lastly, spelling measures the student’s ability to detect errors in grammar, punctuations, and capitalizations. The norm was obtained through percentiles and stanines. 76 school districts are included and test the grade 8-12 students in Schools at District of Columbia. Schools with 300 or more students each were included. Small school district’s entire enrollment is grade 8-12 also participated. For the large schools district, representative were included taking into consideration the school achievement and racial composition. All in all there were 14, 049 8th grade students, 14, 793 grade 9 students, 13,613 10th grade students, 11,573 11th grade, and 10,764 12th grade students. The reliability was computed through split-half and get the reliability coefficients. Validity was determined and it can be said that the coefficient presented demonstrates the utility of Differential Aptitude Test for educational guidance. Each of the tests is potentially useful as to what the expectancy tables evidently show the validity coefficient. Flanagan Industrial Test This test is a set of 18 short tests designed for use with adults in personnel selection programs for a wide variety of jobs. The tests are short and self-administering. The FIT battery measures 18 subscales including arithmetic, assembly, components, coordination, electronics, expression, ingenuity, inspection, judgment, and comprehension, mathematics and reasoning, mechanics, memory, patterns, planning, precision, scales, tables, and vocabulary. Arithmetic measures the accuracy in working with numbers. Assembly measures the ability to visualize the appearance of an object assembled from separate parts. Component is the ability to locate and identify important parts of a whole. Coordination tests the coordination of arms and hand. Electronics measures the understanding and electronic principles and analyze diagrams of electrical circuits. Expression is the ability to feel and having the knowledge of correct English, ability to convey ideas in writing and talking. Ingenuity refers to being creative and inventive
224
and having the ability to devise procedures equipment and presentations. Inspection is the ability to spot flaws and imperfections ion series of articles accurately and quickly. Judgment and comprehension is the ability to read with understanding and use good judgment n interpreting materials. Math and reasoning refers to the understanding basic math concepts and ability to apply in solving certain problems. Mechanics is the ability to understand mechanical principles and analyze mechanical movements. Memory tests the learning and recalling ability in terms of association. Patterns refer to the ability to perceive and reproduce simple pattern outlines accurately. Planning is the ability to foresee problems that may arise and anticipate the best order for carrying out steps. Precision refers to the ability to make appropriate figure movements with accuracy. Scales is the ability to read and understand what the scales graphs and charts are conveying. Tables refer to the ability to read and understand tables accurately and quickly. Vocabulary refers to the ability to choose the tight terms to convey ones idea. The standard sample in this test are 12th grade students. The reliability of the test was determined and reported the reliability coefficients ranging from .50-.90 from the individual test. When FIT was correlated with FACT the range was .28 (memory) to .79 (arithmetic). For the validity of the test, it is said that many of the short test has fairly substantial reliability coefficients ranging from .20 to .50 using step wise multiple regression. It is also found that 5 of the three tests namely, Math and reasoning, Judgment and comprehension, Planning, Arithmetic and Expression yield multiple correlation of .5898 with fall semester GPA. The first four tests along with vocabulary and precision provide a multiple correlation of .47 in spring semester GPA. In general, multiple correlation vary from .57- to.40 Personality Test Edwards Personal Preference Schedule This test was created by Allen L. Edwards and was published by The Psychological Corporation. This test is an instrument for research and counseling purposes. It can provide convenient measures of independent personality variables. Moreover, it provides measure for test consistency and profile stability. It is a non-projective personality test that was derived from H. A. Murray’s theory which measures the rating of individuals in fifteen normal needs or motives. These needs or motives from Murray’s theory are the statements used in the Edwards Personal Preference Schedule. It consists of 15 scales including achievement, deference, order, exhibition, autonomy, affiliation, interception, succorance, dominance, abasement, nurturance, change, endurance, heterosexuality, and aggression. Achievement is described as the desire of the person to exert best effort. Deference is the tendency of the person to get suggestions from other people, doing what is expected praising others conforming and accept other’s leadership. Order is the neatness and organization in doing one’s work, arranging everything in proper order so everything will run smoothly. Exhibition is the tendency of saying smart and clever things to gain other’s praise and be the center of attention. Autonomy is the ability to do whatever desired, avoiding conformity and making independent decisions. Affiliation is having a lot of friends, ability to form new acquaintance, and build intimate attachments with others. Intraception is the tendency to put oneself on other’s shoes, and analyzing other’s behaviors and motives. Succorance is the desire to be helped by other’s in times of trouble, seeks encouragement and wants others to be sympathetic to him. Dominance is the tendency of the person to argue with another’s view, act as a leader in the group thereby influencing others and make group decisions.
225
Abasement is the tendency to feel guilty when someone commits a mistake, accepts blame and feels the need of confession after a mistake is done. Nurturance is the ability to help friends who are in trouble, desire to help the less fortunate ones, showing great affection to others and being kind and sympathetic. Change is the tendency to explore on new things, doing things out off routine. Endurance is the ability to keep on doing the task until it is finished and sticking on the problem until it is solved. Heterosexuality is the desire to go out with friends in the opposite sex, becoming physically attracted to the people in the opposite sex and being sexually excited. Lastly, aggression is the tendency to criticize others in public, attacking contrary points of view and making fun of others. This test is intended for college students and adults. To set the norm, 1,509 students in college were included. Norm includes High School graduates and college training. It was consist of 749 college females and 760 college males. Still, part of the sample in the norm was adults consisting of male and female households heads who are members of consumer purchase panel used for market surveys. They were from rural and urban areas of countries in the 48 states. The consumer panel consisted of 5105 households. For the reliability, a split-half reliability coefficient technique was used. The coefficients of internal consistency for 1,509 students in the college normative group range from .60 to .87 with a median of .78. A testretest stability coefficient with a one-week interval was also conducted. These are based on a sample of 89 students and range from .55 to .87 with a median of .73. Other researchers have reported similar results over a three-week period, showing correlations of .55 to 87 with a median of .73. On the other hand, for validity, the manual reports studies comparing the EPPS with the Guilford Martin Personality Inventory and the Taylor Manifest Anxiety Scale. Other researchers have correlated the California Psychological Inventory, the Adjective Check List, the Thematic Apperception Test, the Strong Vocational Interest Blank, and the MMPI with the EPPS. In these studies there are often statistically significant correlations among the scales of these tests and the EPPS, but the relationships are usually low-to-moderate and often are difficult for the researcher to explain. Guilford-Zimmerman Temperamental Survey This inventory is developed for the organizational psychologists, personnel professionals, clinical psychologists, and counseling professionals in mental health facilities, businesses, and educational settings. This was developed with the aim to help measure attributes related to personality and temperament that might help predict successful performance in various occupations, to identify students who may have trouble adjusting to school and the types of problems that may occur, to assess temperamental trends that may be the source of problems and conflicts in marriage or other relationships and provide objective personality information to complement other data that may assist with personnel selection, placement, and development. This test provides a nonclinical description of an individual's personality characteristics that can be used in career planning, counseling, and research. Its subscales include General Activity (G), Restraint (R), Ascendance (A), Sociability (S), Emotional Stability (E), Objectivity (O), Friendliness (F), Thoughtfulness (T), Personal relations (P) and Masculinity (M). A high score in General Activity means having a strong drive and energy. A high score in Restraint may mean not being a happy go lucky, carefree and impulsive. Ascendance high score means being a riderough-shod over others, typically for works of foremen and supervisors. A high score in sociability means optimism and cheerfulness. Obtaining high score in objectivity may mean less egoistic and insensitiveness. High score in friendliness means lack of fighting tendencies, and
226
desires to be liked by others. Obtaining high score in thoughtfulness may pertain to men who have an advantage in getting supervisory positions. Personal relations high scores mean the high capability of getting along with other people. High score in masculinity may pertain to people who behave in ways that are more acceptable to men. Examples of items in this test are “You like to play practical jokes in others” and “Most people are out to get more than they give”. Standardization of this test was done by gathering 523 college men and 389 college women in one southern California University and two junior colleges for all except to trait thoughtfulness. In the male sample, there were veterans aging 18-30 year old. Reliability was calculated using KR#20 and obtained reliability ranging from .79 for general activity and .87 for sociability. However in intercorrelations of ten traits gratifies low reliability coefficients, only two scores are high, between Sociability and Ascendance and between Emotional Stability and Objectivity. For the validity, it is believed that what each score measures is fairly well-defined and that the score represent a confirmed dimension of personality and a dependable descriptive category. Most impressive validity data have come from the use of inventories with supervisory and administrative personnel. IPAT 16 Personality Factors The 16 Pf was originally developed by Raymond Cattel, Karen Cattel, and Heather Cattle Help identify personality factors. It can be administered to individuals 16 years and older. There are 16 bipolar dimensions of personality and 5 global factors. The bipolar dimensions of Personality are Warmth (Reserved vs. Warm; Factor A), Reasoning (Concrete vs. Abstract; Factor B), Emotional Stability (Reactive vs. Emotionally Stable; Factor C), Dominance (Deferential vs. Dominant; Factor E), Liveliness (Serious vs. Lively; Factor F) Rule-Consciousness (Expedient vs. Rule-Conscious; Factor G), Social Boldness (Shy vs. Socially Bold; Factor H), Sensitivity (Utilitarian vs. Sensitive; Factor I), Vigilance (Trusting vs. Vigilant; Factor L), Abstractedness (Grounded vs. Abstracted; Factor M), Privateness (Forthright vs. Private; Factor N), Apprehension (Self-Assured vs. Apprehensive; Factor O), Openness to Change (Traditional vs. Open to Change; Factor Q1), Self-Reliance (Group-Oriented vs. SelfReliant; Factor Q2), Perfectionism (Tolerates Disorder vs. Perfectionistic; Factor Q3), Tension (Relaxed vs. Tense; Factor Q4). The global factors are Extraversion, Anxiety, Toughmindedness, Independence, and Self-Control. A stratified random sampling that reflects the 2000 U.S. Census was used to create the normative sample, which consisted of 10,261 adults. Testretest coefficients offer evidence of the stability over time of the different traits measured by the 16 PF. Pearson-Product Moment correlations were calculated for two-week and two-month testretest intervals. Reliability coefficients for the primary factors ranged from .69 (Reasoning, Factor B) to .86 (Self-reliance, Factor Q2)with a mean of .80. Test-retest coefficients for the global factors were higher, ranging from .84 to .90 with a mean of .87. Cronbach’s alpha values ranged from .64 (Openness to Change, Factor Q) to .85 (Social Boldness, Factor H), with an average of .74. Validity of the 16 PF (5th ed.) demonstrated its ability to predict various criterion measures such as the Coopersmith Self-esteem Inventory, Bell’s adjustment inventory, and social skills inventory. Its subscales are correlated well with the factors of the Myers-Briggs Type Indicator. Myers-Briggs Type Indicator
227
The Myers-Briggs Type Indicator (MBTI) assessment is a psychometric questionnaire designed to measure psychological preferences in how people perceive the world and make decisions. These preferences were extrapolated from the typological theories originated by Carl Gustav Jung, as published in his 1921 book Psychological Types. The original developers of the personality inventory were Katharine Cook Briggs and her daughter, Isabel Briggs Myers. This test is suited for 14 years old requires 7th grade reading level. It has 8 Factors: Extraversion, Sensing, Thinking, Judging, Introversion, intuition, Feeling, Perceiving. People with a preference for Extraversion draw energy from action: they tend to act, then reflect, then act further. If they are inactive, their level of energy and motivation tends to decline. Conversely, those whose preference is Introversion become less energized as they act: they prefer to reflect, then act, then reflect again. People with Introversion preferences need time out to reflect in order to rebuild energy. Sensing and intuition are the information-gathering (Perceiving) functions. They describe how new information is understood and interpreted. Individuals who prefer Sensing are more likely to trust information that is in the present, tangible and concrete: that is, information that can be understood by the five senses. They tend to distrust hunches that seem to come out of nowhere. They prefer to look for details and facts. For them, the meaning is in the data. On the other hand, those who prefer intuition tend to trust information that is more abstract or theoretical, that can be associated with other information (either remembered or discovered by seeking a wider context or pattern).Thinking and Feeling are the decision-making (Judging) functions. The Thinking and Feeling functions are both used to make rational decisions, based on the data received from their information-gathering functions (Sensing or intuition). Those who prefer Thinking tend to decide things from a more detached standpoint, measuring the decision by what seems reasonable, logical, causal, consistent and matching a given set of rules. Those who prefer Feeling tend to come to decisions by associating or empathizing with the situation, looking at it 'from the inside' and weighing the situation to achieve, on balance, the greatest harmony, consensus and fit, considering the needs of the people involved. Myers and Briggs taught that types with a preference for Judging show the world their preferred Judging function (Thinking or Feeling). So TJ types tend to appear to the world as logical, and FJ types as empathetic. According to Myers, Judging types prefer to "have matters settled." Those types ending in P show the world their preferred Perceiving function (Sensing or intuition). So SP types tend to appear to the world as concrete and NP types as abstract. According to Myers, Perceiving types prefer to "keep decisions open. The validity of the test was tested and it was found that unlike other personality measures, such as the Minnesota Multiphasic Personality Inventory or the Personality Assessment Inventory, the MBTI lacks validity scales to assess response styles such as exaggeration or impression management. The MBTI has not been validated by double-blind tests, in which participants accept reports written for other participants, and are asked whether or not the report suits them, and thus may not qualify as a scientific assessment. With regard to factor analysis, one study of l29l college-aged students found six different factors instead of the four used in the MBTI. In other studies, researchers found that the JP and the SN scales correlate with one another. For reliability, some researchers have interpreted the reliability of the test as being low, with test takers who retake the test often being assigned a different type. According to some studies, 39–76% of those tested fall into different types upon retesting some weeks or years later. About 50% of people tested within nine months remain the same overall type and 36% remain the same after nine months. When people are asked to compare their preferred type to that assigned by the MBTI, only half of people pick the same profile. Critics also argue that the MBTI lacks falsifiability, which can cause confirmation
228
bias in the interpretation of results. The standardization was made using high school, college, and graduate students; recently employed college graduates; and public school teachers. Panukat ng Ugali at Pagkatao This test, also called PUP,w as developed by Virgilio G. Enriquez and Ma. Angeles Guanzon-Lapena. It was published by the Research Training House in 1975. Panukat ng Ugali at Pagkatao is a psychological test that can be used for research, employment, and screening of members and students in an institution. Its reliability is .90 and test-retest reliability result was .94 (p< .01). It has four trait subscales and each has underlying personality traits. The four trait subscales are Extraversion or Surgency, Aggreableness, Conscientiousness, and Emotional Stability. Under extraversion are ambition (+), guts/ daring (+), shyness or timidity (-) and conformity (-). Ambition is the tendency of a person to act towards the accomplishment of his/ her goal. Guts/daring is the courage which is a very strong emotion from the person within. It can be related to things that are in risk or danger be it in life, aspect of life and material things. Shyness or timidity is the trait of being timid, reserved and unassertive. A person that is shy tends not to socialize with others, does not engage in eye contact and lose trust to oneself so prefers to be alone. Conformity is the tendency of a person to take into consideration what other people is saying especially if that person has a higher position to him/ her. A conforming person tends to disregard one's own opinion. For the agreeableness, the factors are respectfulness (+), generosity (+), humility (+), helpfulness (+), difficulty to deal with (-), criticalness (+), and belligerence (-). Respectfulness is the trait of giving value to the person you are taking to regardless of his/ her position and age. Generosity is the ability to satisfy the needs of others by giving what they need or want even it is not in accordance of one’s personal desire. Humility is the trait of showing modesty and humbleness in dealing with other people, not boast of her accomplishments and status in life. Helpfulness is the desire to attend to others need and fill their shortcomings. Difficulty to deal with others is the tendency of the person to agree on something after many attempts of request. Criticalness is the tendency of the person to criticize every small details of something, giving attention to things that are rarely noticed by others. Belligerence is the trait of a person of being war freak hot headed, easily angered and frequently encounters trouble due to short or absence of patience. For the conscientiousness dimension, the personalities are thriftiness, perseverance, responsibleness, prudence, fickle mindedness, and stubbornness. Thriftiness is the ability of a person to manage his/ her resources wisely and conservative in spending money. Perseverance is the persistence of a person to achieve ones goal and being constant with the things already started until it is finished. Responsibleness is the capacity to do the task assigned to him/ her and being accountable to it. Prudence is the ability to make sound and careful decisions by weighing the available options. Fickle mindedness is the tendency of the person to think twice before finally making up one’s mind and having constantly changing mind ones in a while. Finally, stubbornness is the determination to done thins despite any prohibitions, hindrances and objections and hard to convince that he/ she has committed a mistake. For the fourth dimension which is emotional stability, 4 traits are included which are restraint (+/-), sensitiveness (-), low tolerance to joking/ teasing (-) and mood (-). Restraint is the tendency of the person not to show his/ her intense emotion, keeping one’s own feelings as a self-control strategy. Sensitiveness is the tendency of the person to be easily hurt or affected by little things said or done that the person does not like. Low tolerance to joking/ teasing is the tendency of the person to have intense emotion due to teasing or provocation of other people.
229
The mood is the tendency to show unusual attitude or behavior and changing emotion due to an unexpected event that happened. The last dimension of this test is the intellect or openness to experience includes 3 personality traits such as thoughtfulness, creativity and inquisitiveness. Thoughtfulness id the tendency of the person to be so concerned with the future especially regarding the problems or troubles. Creativity is the natural ability of the person to make or create something out of local materials or resources, having the ability to express oneself, wide imagination and high inclination to music arts and culture. Last, inquisitiveness is the trait of the person to be curious and sometimes intrusive. To be able to make a norm, 3702 ethnic group were asked to participate: 412 Bicolano, 152 Chabacano, 642 Ilocano, 489 cebuano, 170 Ilonggo, 190 Kapampangan, 513 tagalog, 378 waray, 29 Zambal and 83 others. For the validity of the test, all items are said to have positive direction. 2 subscales for validity were used, denial (certain that the respondents will disagree with the statement such as “I never told a lie in my entire life”) and tradition (certain that the respondents will agree to the statement such as “I would take care of my parents when I get old”) Panukat ng Pagkataong Pilipino This test was developed by Anadaisy J. Carlota, from the psychology department in University of the Philippines. It was published in Quezon City in the year of 1989. PPP is a 3form personality test designed to measure 19 personality dimensions. Each personality corresponds to subscales which are comprised of homogeneous subset of items. The three forms are the Form K, form S and the form KS. Form K corresponds to the salient traits for interpersonal relations. Under this form are 8 personality traits which include thoughtfulness, social curiosity, respectfulness, sensitiveness, obedience, helpfulness, capacity to be understanding, and sociability. Thoughtfulness is the tendency to be considerate to others. A person who is thoughtful tries not to be inconvenient to other people. Social curiosity is the inquisitiveness about other’s life. A person who is socially curious tends to ask everything to someone and loves to know everything that is happening around him/ her. Respectfulness is the tendency of people to recognize one’s belief and privacy. Behavior of respectful person is concertized by simply knocking on the door first before entering. Sensitiveness is the tendency of a person to be affected easily by any negative type of criticisms. So, a sensitive person does not want to hear any negative criticisms from other people. Obedience is the tendency of a person to do what others demand of him/her. An obedient person tends to follow whatever commanded to him by others. Helpfulness is the tendency of a person to offer service to others, extend help and give resources. It is characterized by a person who is always willing to lend his/her things to others. The capacity to be understanding is the person's tolerant to other people's shortcomings and when this person is hurt by others; he/she is always ready to listen to explanations. And lastly, sociability is the ability of the person to easily get along and befriend with others. In social gatherings or event this person will always take the first move to introduce himself/ herself to others. The second form of this test is Form S which includes 7 factors such as orderliness, emotional stability, humility, cheerfulness, honesty, patience, and responsibility. Orderliness is the neatness and organization in one’s appearance and even in work. The person the is orderly puts his/her things in proper places. Emotional Stability is the ability of the person to control his/ her emotions and manage to remain calm even when face in a great trouble. Humility is the tendency to remain modest despite accomplishments and readily accepts own mistakes. The person with humble personality does not boast about his/ her successes.
230
Cheerfulness is the disposition of the person to be cheerful and see the happy and funny aspects of things that happen. A cheerful person is one that who always find funny things about situations. Honesty is the sincerity and truthfulness of a person. A person that is honest tends to be tell the truth in every situation regardless of the feelings of others. Patience is the ability to cope up with daily life's routine and repetitive activities. A patient person is one who responds to a child's repetitive questions without getting mad. Lastly, responsibility is the tendency of the person to do a particular task upon own initiative. A responsible person is characterized by not procrastinating in accomplishing an activity. For the last form of PPP, the Form KS, there are 4 subscales which include creativity, risk-taking, achievement orientation and intelligence. Creativity is the ability of being innovative, and think of various strategies in solving a problem. Risk-taking is the tendency to take new challenges despite the unknown consequences. A risktaker person is the one who believes that one must take risks to be successful in life. Achievement-orientation is the tendency of the person to strive for excellence and to emphasize quality over quantity in every task he/ she does. And lastly, intelligence is the trait of a person to perceive oneself as an intelligent person. This is also characterize by easily understanding the material being read. This test can be taken by person with age ranging from 13 and above. It is already written in Filipino and has translations on English, Cebuano, Ilokano and ilonggo. During its pretest 245 respondents were included with 13-81 years old. There were more females then. The reliability was tested through internal consistency reliabilities. All personality dimensions except Achievement orientation has high reliability. Internal consistency reliability was done three times. At first top 10 personalities were gotten, then top 12 then top 14. For the fourth time, top 8 was taken and were included in the inventory. Form K has a mean reliability coefficient of .69, Form s .81 and Form KS .72. For the validity, construct validity was applied wherein internal structures of the original version of PP before clustering on 3 forms intercorrelations among the subscales were obtained. The test was valid because for one, more positive intercorrelations than negative were obtained. Second, in personality subscales there were also more positive than negative except for social curiosity and sensitiveness. And lastly, the magnitude of the correlations were small to moderate although the majority of the subscales are significant at alpha level of p=.05. The predominance of positive intercorrelation means that the all of the subscales are measuring the same construct which is personality. This test was standardized through norming which was developed in two forms including percentiles and normalized standardized scores with a mean of 50 and standard deviation of 10. Attitude Tests Survey of Study Habits and Attitudes This test was developed in order to help meet the challenge which is: students with high scholastic aptitude were very poor in schools while the mediocre in the scholastic test were doing well in school. This test is easily administered study of methods, motivation for studying certain attitudes toward scholastic activities which are important in the classrooms. The purpose of developing this is to identify the students whose study habit and attitudes are different from those of students who earn high grades, to aid understanding of the students with academic difficulties and to provide a basic for helping such students improve their study habits and attitudes and thus more fully realize their best potentialities. To add to this, study habits are believed to be a strong predictor of achievement. This test consists of Form C for college and Form H for high school
231
(grades 7-12). The four basic subscales include delay avoidance, work methods, teacher approval, and educational acceptance. It has 100 items and can be used as screening instrument, diagnostic, teaching aid and teaching tool. There were separate norms for both of the Forms. For Form C, 3054 first semester freshmen enrolled at the following nine colleges were included: Antioch College, Bowling Green State University, Colorado Reed College, San Francisco State College, Southwest Texas State College, Stephen F College, Austin State College, Swarthmore College and Texas Lutheran College. For the Form H 11, 218 students in 16 different towns and metropolitan arena in America participated: Atacosta Texas (10-12), Austin Texas (10-12), Buda Texas (7-12), Durango Colorado (10-12), Olen Ellyn, Illinois (9), Gunnison Colorado (10-12), Hagerstown Maryland (7-12), Marion Texas (7-12), Navarro Texas (7-12), New Brauntels Texas (7-9), Salt Lake City Utah (7-12), San Marcos Texas (7-12), Seguin Texas (7-2), St. Louis Missouri (7-12) and Waelder Texas (7-12). The computed reliability coefficients were baes on the Kuder-Richardson # 8, which ranged from .87 and .89. Using the test-retest method, the coefficients were .93, .91, .88, .90 for delay avoidance, work methods, teacher approval, and educational acceptance respectively in the 4 weeks interval. In the 14 week interval, the reliability coefficients were .88, .86, .83 and .85. For the validity, the criterion used was the onesemester grade point average GPA. SSHA and GPA were correlated and the result was .27-.66 for men and .26- .65 for women. The average validity coefficients for 10 colleges were .42 for men and .45 for women. When SSHA was also correlated to ACE (American Council on educational psychological examination, a scholastic aptitude test was always low. SSHA and Form C correlated to GPA obtained .25-.45 and the weighted average was .36. SSHA and each subscale using Fisher’s z-functions .31 for delay, .32 for work methods, .25 for teacher approval and .35 educational acceptance. Work Values Inventory This test intends to meet the need of assessing the goals which motivate man to work. It measures the values which are extrinsic to as well as those which are intrinsic in work, the satisfactions with men and women seek in work and the satisfactions which may be the concomitants or outcomes of work. It seeks to measure these in boys and girls, in men and women at all age levels beginning with adolescence and at all educational levels beginning with entry into junior high school. It is both in the variety of values tapped and in the ages for which it is appropriate, a wide-range values inventory. Its factors are altruism, esthetic, creativity, intellect stimulation, achievement, independence, prestige, management, economic returns, security, surroundings, supervisory relations, associates, way of life, and variety. Altruism refers to the work which enables the person to contribute with the welfare of others. Esthetic is the works which permits to one to make beautiful things and to contribute to beauty to the world. Creativity pertains to the work which permits one to invent new things, design new products or develop new ideas. Intellect stimulation refers to the work which provides opportunity for independent thinking and for learning how and why things work. Achievement refers to the work which gives one a feeling of accomplishment in a job well. Independence pertains to the work in his own way as fasts or as slowly as he wishes. Prestige pertains to the work which gives one standing in the eyes of other and evokes respect. Management refers to the work which permits one to plan and lay out work for others to do. Economic returns pertains to the work which pays well and enables one to have the things he wants. Security pertains to the work which provides one with certainty of having a job even in the hard times. Surroundings pertains to the work
232
which is carried out under pleasant conditions , not too hot, not too cold, noisy, dirty, etc. Supervisory relations refer to the work which is carried out under a supervisor who is fair and with whom one can get along. Associates refer to the work which brings one into contact with fellow workers whom he likes. Way of life refers to the kind of work that permits to live the kind of life he chooses and to be the type of person he wishes to be. Variety refers to the work that provides an opportunity to do different types of job. One of the items in the inventory under creativity is “Create new ideas, programs or structures departing from those ideas already in existence.” To set the standards of this test, norm was obtained. The sample were grade 7 (902 females, 925 males), 8 (862 females, 949 males), 9 (844 females, 931 males), 10 (772 females, 859 males), 11 (824 females, 814 males), and 12 (724 females and 672 males). Reliability was obtained through test-retest method and the reliability coefficients reported were: .83, .82, .84, .81, .83, .83, .76, .84, .88, .87, .74, .82, and .80 for all the subscales. Validity was also determined through construct, content, concurrent and predictive validity. Some of the construct validity were obtained by correlating Altruism subscale to Social Service Scale (r= .67) and to Social scale of AVL (r= .29). Also Esthetic subscale with Artist key SVIB (r= .55), with artictic scale of Kuder (r=.48) and with Aesthetic Scale of AVL (r=.08). Interest Test Brainard Occupational Preference Inventory This test was designed to be able to have a systematic study of a person’s interest. It is a standardized questionnaire that is designed to bring to the fore of the facts about a person with respect to his occupational interest so that he and his advisers can more intelligently and objectively discuss his educational and occupational plans. This test is intended for 8-12th grade students and adults. It requires relatively low reading skills as determined by the readability formulas. It provides information concerning a vital phase in the complex matter of setting the person’s vocational plans wisely and planning a program for attaining his goals. It yields score in 6 broad occupational fields for each service. Both females and males obtain scores in fields identified as commercial, mechanical, professional, esthetic and scientific. Agricultural score is only for boys and personal service is only for girls. Each field has 20 questions divided among 4 occupational sections. A 5-point scale was used from strongly dislike to strongly like. The sample in the norm includes 10, 000 students in 14 school system, both males and females from grade 8 to 12. Reliability was obtained through test-retest and boys got r=.73 in Commercial and .88 in scientific scores while girls obtained .71 in the commercial and .84 in esthetic. Another reliability method used was split-half and boys obtained .88 in commercial scores and .95 in mechanical and scientific scores while girls obtained .82 in commercial scores and .95 in scientific scores. For the test of validity, Brainard was correlated to Kuder Personal Preference record and it was found that the latter test measures different in such a way that its focus is in the interest and forces the respondents to choose three activities indicative of different types of interest.
Activity
233
A. Look for other Standardized Tests and report its current validity and reliability. B. Administer the test that you created in Lesson 2 chapter 5 to a large sample. Then create a norm.
References
(1973). Measuring intelligence with the culture fair test: Manual for scales 2 and 3. Institute of Personality and Ability Testing, Philippines Bennett, G.K., Seashore, H.G., & Wesman, A.G. (1973). 5th edition manual for the differential aptitude test forms s and t. The Psychological Corporation, New York Brainard, PP. & Brainard, R.T. (1991). Brainard occupational preference inventory manual. Bird Avenue, San Jose California USA Briggs, K.C., & Myers, I.B. (1943). The Myers-Briggs Type Indicator Manual. Consulting Psychologists Press, Inc. Brown, W.F., & Holtzman, W.H. (1967). Survey of study habits and attitudes: SSHA manual. The Psychological Corp, East 45th Street New York Carlota, A. (1989). Panukat ng pagkataong pilipino PPP Manual. Quezon City Philippines. Edwards, A.L. (1959). Edwards personal preference schedule manual. Psychological Corporation, New York Enriquez, V. G. & Guanzon, M.A. (1975). Panukat ng ugali at pagkatao manual. PPRTH-ASP Panukat na Sikolohikal Flanagan, J.C. (1965). Flanagan industrial test manual. Science Research Associates, East Street Chicago Illinois Gardner, E.F., Rudman, H.C., Karlson, B., & Merwin, J.C. (1981). Manual directions for administering stanford schievement test. Harcourt Brace and Jovanovich, Inc., New York Guilford, J.P.,& Zimmerman, W.S. (1949). Guilford zimmerman temperament survey: Manual of instructions and interpretations. Harcourt Brace and Jovanovich, Inc., New York Otis, A.S. & Lennon, R.T. (1957). Ottis-Lennon mental ability test manual for administration. Harcourt Brace and Jovanovich, Inc., New York Otis, A.S. & Lennon, R.T. (1979). Ottis-Lennon mental ability test manual for administration and interpretation. Harcourt Brace and Jovanovich, Inc., New York
234
Prescott, G.A., Balow, I.H., Hogan, T.P. & Farr, R.C. (1978). Advanced 2: Metropolitan achievement tests: Forms JS and KS. Harcourt Brace and Jovanovich, Inc., New York Raven, J., Raven, J.C., & Court, J.H. (2003). Manual for raven's progressive matrices and vocabulary scales. section 1: General overview. San Antonio, TX: Harcourt Assessment. Super, D.E. (1970). Manual: Work values inventory. Houghton Mifflin Company Thurstone, L.L & Thurstone, T.G. (1967). SRA verbal examiner’s manual. Science Research Associates, East Street Chicago Illinois Watson, G., & Glaser, E.M. (1964). Watson-Glaser critical thinking appraisal: Manual for forms Ym and Zm. Harcourt Brace and Jovanovich, Inc., New York
Chapter 9
235
The Status of Educational Assessment in the Philippines
Objectives 1. Realize the strong foundation of the field of educational assessment in the Philippines. 2. Describe the history of formal assessment in the Philippines. 3. Describe the pattern of assessment practices in the Philippines. Lessons 1 2
Assessment in the Early years Assessment in the Contemporary Period and Future Directions
Lesson 1
236
Assessment in the Early Years
Monroe Survey (1925) Formal Assessment in the Philippines started as mandate from the government to look into the educational status of the country (Elevazo, 1968). The first assessment was conducted through a survey authorized by the Philippine legislature in 1925. The legislature created by the Board of Educational Survey headed by Paul Monroe. Later the board appointed an Educational Survey Commission who was also headed by Paul Monroe. This commission visited different schools around the Philippines. They observed different activities conducted in schools around the Philippines. The results of the survey reported the following: 1. The public school system that is highly centralized in administration needs to be humanized and made less mechanical. 2. Textbook and materials need to be adapted to Philippine life. 3. The secondary education did not prepare for life and recommended training in agriculture, commerce, and industry. 4. The standards of the University of the Philippines was high and it should be maintained by freeing the university from political interference. 5. Higher education be concentrated in Manila. 6. English as medium of instruction was best. The use of local dialect in teaching character education was suggested. 7. Almost all teachers (95%) were not professionally trained for teaching. 8. Private schools except under the religious groups were found to be unsatisfactory. Research, Evaluation, and Guidance Division of the Bureau of Public Schools This division started as the measurement and Research Division in 1924 that was an off shoot to the Monroe Survey. It was intended to be the major agent of research in the Philippines. Its functions were: 1. To coordinate the work of teachers and supervisors in carrying out testing and research programs 2. To conduct educational surveys 3. To construct and standardize achievement tests Economic Survey Committee In a legislative mandate in 1927, the director of education created the Economic Survey Committee headed by Gilbert Perez of the Bureau of Education. The survey studied the economic condition of the Philippines. They made recommendations as to the best means by which graduates of the public school could be absorbed to the economic life of the country. The results of the survey pertaining to education include: 1. Vocational education is relevant to the economic and social status of the people.
237
2. It was recommended that the work of the schools should not be to develop a peasantry class but to train intelligent, civic-minded homemakers, skilled workers, and artisans. 3. Devote secondary education to agriculture, trades, industry, commerce, and home economics.
The Prosser Survey In 1930, C. A. Prosser made a follow-up study on vocational education in the Philippines. He observed various types of schools and schoolwork. He interviewed school officials and businessman. He recommended in the survey to improve various phases of the vocational educational such as 7th grade shopwork, provincial trade schools, practical arts training in the regular high schools, home economics, placement work, gardening, and agricultural education. Other Government Commissioned Surveys After the Prosner survey there were several surveys conducted to determine mostly the quality of schools in the country after the 1930’s. All of these surveys were government commissioned such as the Quezon Educational Survey in 1935 headed by Dr. Jorge C. Bacobo. Another study was made in 1939 which is a sequel to the Quezon Educational Surveys which made a thorough study of existing educational methods, curricula and facilities and recommend change son financing public education in the country. This was followed by another congressional survey in 1948 by the Joint Congressional Committee on Education to look into the independence of the Philippines from America. This study employed several methodologies. UNESCO Survey (1949) The UNESCO undertook a survey on Philippine Education from March 30 to April 16, 1948 headed by Mary Trevelyan. The objective of the survey was to look at the educational situation of the Philippines to guide planners of subsequent educational missions to the Philippines. The report of the surveys was gathered from a conference with educators and layman from private and public school all over the country. The following were the results: 1. There is a language problem and proposed a research program. 2. There is a need to for more effective elementary education. 3. Lengthening of the elementary-secondary program from 10 to 12 years. 4. Need to give attention to adult education. 5. Greater emphasis on community school 6. Conduct thorough surveys to serve as basis for long-range planning 7. Further strengthening of the teacher education program 8. Teachers income have not kept pace with the national income or cost of living 9. Delegation of administrative authority to provinces and chartered cities 10. Decrease of national expenditure on education 11. Advocated more financial support to schools from various sources
238
After the UNESCO study, it was followed by further government studies. In 1951, the Senate Special Committee on Educational Standards of Private schools undertook to study private schools. This study was headed by Antonio Isidro to investigate the standards of instruction in private institutions of learning and to provide certificates of recognition in accordance with their regulations. In 1967, the Magsaysay Committee on General Education that was financed by the University of the East Alumni Association. In 1960, the National Economic Council and the International Cooperation Administration surveyed public schools. The survey was headed by Vitaliano Bernardino, Pedro Guiang, and J. Chester Swanson. Three recommendations were provided to public schools: (1) To improve the quality of educational services, (2) To expand the educational services, and (3) To provide better financing for the schools. The assessment conducted in the early years were mandated and/or commissioned by government which was also initiated by the government. The private sectors were not yet included in the studies as proponents and usually headed by foreign counterparts such as the UNESCO and the Monroe, and Swanson survey. The focus of the assessments was on the overall education of the country which is considered national research given the need of the government to determine the status of the education in the country.
239
Lesson 2 Assessment in the Contemporary Period and Future Directions
EDCOM Report (1991) The EDCOM report in 1991 indicated that high dropout rates especially in the rural areas were significantly marked. The learning outcomes as shown by achievement levels show mastery of the students in important competencies. There were high levels of simple literacy among both 15-24 year olds and 15+ year olds. “Repetition in Grade 1 was the highest among the six grades of primary education reflects the inadequacy of preparation among the young children. All told, the children with which the formal education system had to work with at the beginning of EFA were generally handicapped by serious deficiencies in their personal constitution and in the skills they needed to successfully go through the absorption of learning.” Philippine Education Sector Study (PESS-1999) The PESS was jointly conducted by the World Bank and Asian Development Bank. It was recommended that: 1. A moratorium on the establishment of state colleges and universities. 2. Tertiary education institutions be weaned from public funding sources. 3. A more targeted program of college and university scholarships Aside from the government initiatives in funding and conducting surveys that applies assessment methodologies and processes. Aside from survey studies, the government also practiced testing where they screen government employees which started in 1924. Grade four to fourth year high school students were tested in the national level in 1960 to 1961. Private organizations also spearhead the enrichment of assessment practices in the Philippines. These private institutions are the Center for Educational Measurement (CEM) and the Asian Psychological Services and Assessment Corporation (APSA). Fund for Assistance to Private Education (FAPE) FAPE started with testing programs such as the guidance and testing program in 1969. They started with the College Entrance Test (CET) which was first administered in 1971 and again in 1972. The consultants who worked with the project were Dr. Richard Pearson from the Educational Testing Service (ETS), Dr. Angelina Ramirez, and Dr. Felipe. FAPE then worked with the Department of Education, Culture, and Sports (DECS) to design the first National College Entrance Exam (NCEE) that will serve to screen fourth year high school students who are eligible to take a formal four-year course. There was a need to administer a national test then because most universities and colleges do not have an entrance exam to screen students. Later the NCEE was completely endorsed by FAPE to the National Educational Testing Center of the DECS. The testing program of FAPE continued where they developed a package of four tests which are the Philippine Aptitude Classification Test (PACT), the Survey/Diagnostic Test (S/DT), the College Scholarship Qualifying Test (CSQT), and the College Scholastic Aptitude
240
Test (CSAT). In 1978, FAPE institutionalized an independent agency called the Center for Educational Measurement that will undertake the testing and other measurement services. Center for Educational Measurement CEM started as an initiative of the Fund for Assistance to Private Education (FAPE). CEM was headed by Dr. Leticia M. Asuzano who was the executive vice-president. Since then several private schools have been members to CEM to continue their commitment and goals. Since 1960 CEM has developed up to 60 more tests focused on education such as the National Medical Admissions Test (NMAT). The main advocacy of CEM is to improving the quality of formal education through its continuing advocacy and supporting systematic research. CEM promote the role of educational testing and assessment in improving the quality of formal education at the institutional and systems level. Through test results, the CEM helps to improve effectiveness so teaching and student guidance. Asian Psychological Services and Assessment Corporation Aside from the CEM, in 1982 there is a growing demand for testing not only in the educational setting but in the industrial setting. Dr. Genevive Tan who was a consultant to various industries felt the need to measure the Filipino ‘psyche’ in a valid way because most industries use foreign tests. The Asian Psychological Services and Assessment Corporation was created from this need. In 2001, headed by Dr. Leticia Asuzano, former EVP of CEM, APSA extended its services for testing in the academic setting because of the growing demand of private schools on quality tests. The mission of APSA is a commitment to deliver excellent and focused assessment technologies and competence-development programs to the academe and the industry to ensure the highest standards of scholastic achievement and work performance and to ensure stakeholders' satisfaction in accordance with company goals and objectives. APSA envisions itself as the lead organization in assessment and a committed partner in the development of quality programs, competencies, and skills for the academe and the industry. APSA has numerous tests that measures mental ability, clerical aptitude, work habits, and supervisory attitudinal survey. For the academe side, they have test for basic education, Assessment of College Potential and Assessment of Nursing Potential. In the future the first Assessment for Engineering Potential and Assessment of Teachers Potential will be available for use in higher education. APSA pioneered on the use of new mathematical approaches (IRT Rasch Model) in developing tests which goes beyond the norm-reference approach. In 2002 they launched the standards-based instruments in the Philippines that serve as benchmarks in the local and international schools. Standards-based assessment (1) provides an objective and relevant feedback to the school in terms of its quality and effectiveness of instruction measured against national norms and international standards; (2) Identifies the areas of strengths and the developmental areas of the institution's curriculum; (3) Pinpoints competencies of students and learning gaps which serve as basis for learning reinforcement or remediation; (4) Provides good feedback to the student on how well he has learned and his readiness to move to a higher educational level.
241
Building Future Leaders and Scientific Experts in Assessment and Evaluation in the Philippines There are only some universities in the Philippines that offer graduate training on Measurement and evaluation. The University of the Philippines offer a master’s program in education specialized in measurement and evaluation and doctor of philosophy in research and evaluation. Likewise, De La Salle University-Manila has a master of science in psychological measurement offered by the psychology department and their college of education which is a center for excellence has a master of arts in educational measurement and evaluation, and a doctor of philosophy in educational psychology major in research, measurement and evaluation. There are only two universities in the Philippines that offer graduate training and specialization on measurement and evaluation courses. Some practitioners were trained in other countries such as in the United States and Europe. There is a greater call for educators and those in the industry involved in assessment to be trained to produce more experts in the field.
Professional Organization on Educational Assessment Aside from the government and educational institutions, the Philippine Educational Measurement and Evaluation (PEMEA) is a professional organization geared n promoting the culture of assessment in the country. The organization started with the National Conference on Educational Measurement and Evaluation headed by Dr. Rose Marie Salazar-Clemeña who was the dean of the College of Education in De La Salle University-Manila together with the De La Salle-College of Saint Benilde’s Center for Learning and Performance Assessment. It was attended by participants all around the Philippines. The theme of the conference was “Developing a Culture of Assessment in Learning Organizations.” The conference aimed to provide a venue for assessment practitioners and professionals to discuss the latest trends, practices, and technologies in educational measurement and evaluation in the Philippines. In the said conference the PEMEA was formed. The purpose of the organization are as follows: 1. To promote standards in various areas of education through appropriate and proper assessment. 2. To provide technical assistance to educational institutions in the area of instrumentation, assessment practices, benchmarking, and process of attaining standards. 3. To enhance and maintain the proper practice of measurement and evaluation in both local and international level. 4. To enrich the theory, practice, and research in evaluation and measurement in the Philippines. The first batch of board of directors elected for the PEMEA are Dr. Richard DLC Gonzales as President (University of Santo Tomas Graduate School), Neil O. Parinas as Vice president (De La Salle–College of Saint Benilde), Dr. Lina A. Miclat as secretary (De La Salle– College of Saint Benilde), Marife M. Mamauag as treasurer (De La Salle–College of Saint Benilde), Belen M. Chu as PRO (Philippine Academy of Sakya). The board members are Dr. Carlo Magno (De La Salle University-Manila), Dennis Alonzo (University of Southeastern Philippines, Davao City), Paz H. Diaz (Miriam Collage), Ma. Lourdes M. Franco (Center for
242
Educational Measurement), Jimelo S. Tapay (De La Salle–College of Saint Benilde), and Evelyn Y. Sillorequez (Western Visayas State University). Aside from the universities and professional organization that provide training on measurement and evaluation, the field is growing in the Philippines because of the periodicals that specialize in the field. The CEM has its “Philippine Journal of Educational Measurement.” The APSA is continuing to publish its “APSA Journal of SBA Research.” And the PEMEA will soon launch the “Educational Measurement and Evaluation Review.” Aside from these journals there are Filipino experts from different institutions who published their work in international journals and journals listed in the Social Science Index. Activity Write an essay describing the future direction of educational assessment in the Philippines.
243
About the Authors Dr. Carlo Magno is presently a faculty of the Counseling and Educational Psychology Department at De La Salle University-Manila where he teaches courses in measurement and evaluation, educational research, psychometric theory, and statistics. He took his undergraduate in De La Salle University-Manila with the degree Bachelor of Arts major in Psychology. He took his Masters degree in Education major in Basic Education teaching at the Ateneo de Manila University. He received his PhD in Educational Psychology major in Measurement and Evaluation at De La Salle University-Manila with high distinction. He was trained in Structural Equations Modeling at Freie Universität in Berlin, Germany. In 2005 he was awarded as the Most Outstanding Junior Faculty in Psychology by the Samahan ng Mag-aaral sa Sikolohiya and in 2007 he was the Best Teacher Students’ Choice Award by the College of Education in DLSU-Manila. In 2008, he was awarded by the National Academy of Science and Technology as the Most Outstanding Published Scientific Paper in the Social Science. Majority of his research uses quantitative techniques in the field of educational psychology. Some of his work on teacher performance, learner-centeredness, measurement and evaluation, self-regulation, metacognition, and parenting were published in local and international refereed journals and presented in local and international conferences. He is presently a board member of the Philippine Educational Measurement and Evaluation Association.
Jerome A. Ouano is a faculty of the Counseling and Educational Psychology Department at De La Salle University-Manila, teaching courses in cognition and learning, interpersonal behavior, educational psychology, facilitating learning, and assessment of learning. He is a trainer in the implementation of the new Pre-service Teacher Education Curriculum of the Philippines in the areas of Assessment of Student Learning, Facilitating Learning, and Field Study, and helps empower the administrators and teachers of many Teacher Education Institutions in Mindanao. Mr. Ouano has presented empirical papers in local and international conferences. He has written books for Field Study courses, and is active in sharing his expertise on assessment with inservice teachers in the basic education as well as the tertiary faculty in many schools in the country. He obtained his Bachelor’s degree in Psychology and Philosophy from Saint Columban College, and his Master’s degree in Peace and Development Studies from Xavier University. He is currently finishing his PhD in Educational Psychology major in Learning and Human Development at De La Salle University-Manila.