It is a common mistake to assume the terms “validity” and “reliability” have the same meaning. While they are related, the two concepts are very different. In an effort to clear up any misunderstandings about validity and reliability, I have defined each here for you. Reliability Of the two terms, reliability is the simpler concept to explain and understand. If you are focusing on the reliability of a test, all you need to ask is—are the results of the test consistent? If I take the test today, a week from now and a month from now, will my results be the same? If an assessment is reliable, your results will be very similar no matter when you take the test. If the results are inconsistent, the test is not considered reliable. Validity Validity is a bit more complex because it is more difficult to assess than reliability. There are various ways to assess and demonstrate that an assessment is valid, but in simple terms, validity refers to how well a test measures what it is supposed to measure. There are several approaches to determine the validity of an assessment, including the assessment of content, criterion-related and construct validity.
An assessment demonstrates content validity when the criteria it is measuring aligns with the content of the job. Also, the extent to which that content is essential to job performance (versus useful-to-know) is part of the process in determining how well the assessment demonstrates content validity.For example, the ability to type quickly would likely be considered a large and crucial aspect of the job for an executive secretary compared to an executive. While the executive is probably required to type, such a skill is not as nearly as important to performing that job. Ensuring an assessment demonstrates content validity entails judging the degree to which test items and job content match each other.
An assessment demonstrates criterion-related validity if the results can be used to predict a facet of job performance. Determining if an assessment predicts performance requires that assessment scores are statistically evaluated against a measure of employee performance.For example, an employer interested in understanding how well an integrity test identifies individuals that are likely to engage in counterproductive work behaviors might compare applicants’ integrity test scores to how many accidents or injuries those individuals have on the job, if they engage in on-the-job drug use, or how many times they ignore company policies. The degree to which the assessment is effective in predicting such behaviors is the extent to which it exhibits criterion-related validity.
An assessment demonstrates construct validity if it is related to other assessments measuring the same psychological construct–a construct being a concept used to explain behavior (e.g., intelligence, honesty).For example, intelligence is a construct that is used to explain a person’s ability to understand and solve problems. Construct validity can be evaluated by comparing intelligence scores on one test to intelligence scores on other tests (i.e., Wonderlic Cognitive Ability Test to the Wechsler Adult Intelligence Scale).
Reliable and Valid? The tricky part is that a test can be reliable without being valid. However, a test cannot be valid unless it is reliable. An assessment can provide you with consistent results, making it reliable, but unless it is measuring what you are supposed to measure, it is not valid. What are the biggest questions you have surrounding reliability and validity?
19 Responses to “A valid test is always reliable but a reliable test is not necessarily valid” 1.
raw2392 says: December 8, 2011 at 3:26 pm
Really good blog, you can tell you have a great understanding of reliability and validity and have clearly done some good research into these areas before writing this blog. Validity and reliability are both important when it comes to research and the output this research produces, if a test is not valid then the results that have come from it can not be trusted. Validity is essential for the results to be taken seriously. Like you have quoted ‘a valid test is always reliable but a reliable test is not necessarily valid’. Reliability is also important for a test to be repeated later on by different researchers and the same results to be found. This is important so that a test can be retested to see if the research can be published. A brilliant blog though and really explained the two concepts well! Reply
o
Mudasir says: November 20, 2017 at 12:34 pm
I agree to U.. but quote “a valid test is always reliable but a reliable test is not necessarily valid” is correct. Reliability is having more weight. Because it has been proven in past
times so its correct. Validity is on spot discussion of test or questionnaire and is also important but not as important as reliability. 2.
3.
Reply Pingback: Week 11 comments for TA | raw2392
psud6e says: December 8, 2011 at 5:20 pm
I think this blog was very good, with a clear understanding of reliability and validity – so much so, you were able to put it in simple terms I think the majority of people will be able to understand. Today, we seem to find that the reliability of research in psychology is only every tested by new researchers investigating a topic. When a researcher completes an experiment, and has gathered all their data, they don’t run the experiment again to look for the same results to test reliability; instead they compare it to other research in the same field. If other researchers support their findings, then the research can be assumed to be reliable. In the same way, reliability of ground-breaking research – stuff completely new to the field – is only found when another experiment in the same field is run. Validity, however, is a harder concept to check. To be honest, can we really be sure something is 100% reliable? We often use hypothetical constructs, especially in psychology, whereby we assume something means another thing. For example, in personality questionnaires, we ask questions that we assume are related to the personality type we are trying to measure. I definitely agree with you that reliability and validity are both important in testing, and a good example of this is in medicine. We want the drug trial to be reliable, because then we know that the drug is safe to use when it is manufactured, the next stage after the trialling process. We also want the trial to be valid. This means that we want it to test the effects of the drug. If it didn’t do this, we may not be able to see and measure the side effects, and therefore a drug that is actually dangerous may be manufactured. Therefore, reliability and validity are definitely important. s mentioned in Key Concepts, reliability and validity are closely related. To better understand this relationship, let's step out of the world of testing and onto a bathroom scale. If the scale is reliable it tells you the same weight every time you step on it as long as your weight has not actually changed. However, if the scale is not working properly, this number may not be your actual weight. If that is the case, this is an example of a scale that is reliable, or consistent, but not
valid. For the scale to be valid and reliable, not only does it need to tell you the same weight every time you step on the scale, but it also has to measure your actual weight. Switching back to testing, the situation is essentially the same. A test can be reliable, meaning that the test-takers will get the same score no matter when or where they take it, within reason of course. But that doesn't mean that it is valid or measuring what it is supposed to measure. A test can be reliable without being valid. However, a test cannot be valid unless it is reliable.
The answer above gives an excellent definition of reliable and unreliable tests, what they are, and how they work. Because of this, I am going to elaborate a little more on your question. An unreliable test is simply unreliable, but how and why can a reliable test become or be unreliable? First, I will talk about how a test is conducted. (Even though your question is about the Social Sciences, it is a little easier to understand if I talk about Science. A bit further down I will change the process over to the Social Sciences.) Let us begin with an elementary science experiment: I place Plant A in the sunlight. I place Plant B under a box. Over the course of a week, I monitor the plant in order to discover how important light is to the growth of the plant. Part of the reliability of the test is based on whether all the control factors are the same. Jimmy, who is running the test, must keep all other variables the same. Both plants need to be watered on the same schedule that is optimal for the plant type. Jimmy must also maintain the integrity of the test- He cannot leave the box off on some days when "he forgets." If he does so, the test becomes unreliable because the integrity has not been maintained- it is invalid. Invalid tests are unreliable because no conclusions can be drawn from the test. If Jimmy maintains the integrity of the test, then it becomes valid- thus reliable on the surface. However, researchers know that one test could have bad results. (Maybe one plant is particularly hardy or has a genetic mutation.) By adding more experiments with the results all ending up the same, it makes the test more reliable. (If I conducted the plant experiment three times and all three results matched my first experiment, I would increase my test's reliability. If the test is run 300 times and 1/2 of the time the plant under the box lived and half of the time it died- the test would be deemed unreliable. Researchers would recognize something in the experiment is askew and must be resolved, and the experiment re-run before conclusions could be drawn. The closer the experiment gets to 100%- same results every time, the more reliable the experiment. In the Social Sciences tests are not be as easy to control, because most of the time you are dealing with some form of human nature. All efforts in an experiment must be made to keep the integrity of the test- such as using a control group, subjects should be random (blind or double blind protocols), ethics must be maintained. If these are not followed- the test is considered invalid thus unreliable.
Results from a test that is invalid (unreliable) cannot be made reliable. The process whereby the results were obtained is corrupted, thus they cannot be used. However, if the experiment is conducted properly and on the surface is considered valid, but the results are conflicting, it becomes unreliable. By performing multiple tests, the reliability of the test in increased. Like the plant experiment- if the test is performed 300 times and the 90% of the results are the same- the reliability is proven. The larger the discrepancy between experiments the more unreliable the test becomes. For example: Let us say I observe a Kindergarten Class to determine if self-control is an important skill to classroom learning in Kindergarten. My research assistant notes the behavior of each child- focus, following directions, and behavior issues (temper tantrums) are observed and documented over the course of a month. A different research assistant (who knows nothing of the first research assistant's work) is then assigned to take each Kindergartner into a room and put a cookie on a plate. He then tells the Kindergartner that he is going to leave the room. He will come back in three minutes. If the cookie is still on the plate, the child will get two cookies instead of one, but if the child eats the cookie, he/she will not get another. The research assistant leaves the room and the child is observed- filmed or monitored by a third research assistant. The child's reaction to the test is noted and the ending result of the experiment is documented. (Who received two cookies and who ate the one?) The results are compared to the first researcher's notes. Did the students who did better in class wait for the second cookie? As long as the protocols are maintained this is a valid experiment. However, in order to prove the reliability of the research, the study should be conducted multiple times with exactly the same protocols. The higher the frequency of similar results, the more reliable the test/results. If the results vary widely with each experiment, or if a separate researcher finds conflicting results through a second experiment- the test is considered unreliable. However, if a third and fourth researcher find the same results as the first researcher, the reliability is returned and the second researcher's results are called into question. (This is common in the Social Sciences.)
list Cite link Link
Related Questions
Explain this statement: If assessment results are highly valid, they will also be reliable. 1 EDUCATOR ANSWER Explain the following statement: If assessment results are highly reliable, they may or may not... 1 EDUCATOR ANSWER Do you support the use of intelligence tests in predicting students' academic performance? Please... 4 EDUCATOR ANSWERS How would you make sure that the language test that you are designing/have designed,will test... 1 EDUCATOR ANSWER Is the TAKS test in public schools doing its job or not?Is the TAKS test in public schools doing... 1 EDUCATOR ANSWER
MORE SOCIAL SCIENCES QUESTIONS » POHNPEI397
| CERTIFIED EDUCATOR
A test is reliable if it gets the same value over and over. A test is valid if it is truly measuring what the researcher thinks it is measuring. A test can be reliable but not valid. Let's say I wanted to measure how smart people are by measuring their heads. I'd get the same value (in inches or centimeters or whatever) every time I measured their head so the test would be reliable. But I wouldn't be actually measuring their intelligence so the test wouldn't be valid. A test cannot be valid if it's not reliable. If the test is not reliable, that means it gives different results every time I do it. If it keeps giving different results, it cannot possibly be measuring what I think it is. Let's say I give a person multiple tests to measure intelligence and they get wildly different results the tests. Clearly, the test is not really measuring intelligence because if it truly measured intelligence it would have to yield results that were nearly the same every time (because we assume a person's intelligence doesn't change from moment to moment).
C. Reliability and Validity In order for assessments to be sound, they must be free of bias and distortion. Reliability and validity are two concepts that are important for defining and measuring bias and distortion.
Reliability refers to the extent to which assessments are consistent. Just as we enjoy having reliable cars (cars that start every time we need them), we strive to have reliable, consistent instruments to measure student achievement. Another way to think of reliability is to imagine a kitchen scale. If you weigh five pounds of potatoes in the morning, and the scale is reliable, the same scale should register five pounds for the potatoes an hour later (unless, of course, you peeled and cooked them). Likewise, instruments such as classroom tests and national standardized exams should be reliable – it should not make any difference whether a student takes the assessment in the morning or afternoon; one day or the next.
Another measure of reliability is the internal consistency of the items. For example, if you create a quiz to measure students’ ability to solve quadratic equations, you should be able to assume that if a student gets an item correct, he or she will also get other, similar items correct. The following table outlines three common reliability measures.
Type of Reliability
How to Measure
Stability or Test-Retest
Give the same assessment twice, separated by days, weeks, or months. Reliability is stated as the correlation between scores at Time 1 and Time 2.
Alternate Form
Create two forms of the same test (vary the items slightly). Reliability is stated as correlation between scores of Test 1 and Test 2.
Internal Consistency (Alpha, a)
Compare one half of the test to the other half. Or, use methods such as KuderRichardson Formula 20 (KR20) or Cronbach's Alpha.
The values for reliability coefficients range from 0 to 1.0. A coefficient of 0 means no reliability and 1.0 means perfect reliability. Since all tests have some error, reliability coefficients never reach 1.0. Generally, if the reliability of a standardized test is above .80, it is said to have very good reliability; if it is below .50, it would not be considered a very reliable test.
Validity refers to the accuracy of an assessment -- whether or not it measures what it is supposed to measure. Even if a test is reliable, it may not provide a valid measure. Let’s imagine a bathroom scale that consistently tells you that you weigh 130 pounds. The reliability (consistency) of this scale is very good, but it is not accurate (valid) because you actually weigh 145 pounds (perhaps you re-set the scale in a weak
moment)! Since teachers, parents, and school districts make decisions about students based on assessments (such as grades, promotions, and graduation), the validity inferred from the assessments is essential -- even more crucial than the reliability. Also, if a test is valid, it is almost always reliable.
There are three ways in which validity can be measured. In order to have confidence that a test is valid (and therefore the inferences we make based on the test scores are valid), all three kinds of validity evidence should be considered. Type of Validity
Definition
Example/Non-Example
Content
The extent to which the content of the test matches the instructional objectives.
A semester or quarter exam that only includes content covered during the last six weeks is not a valid measure of the course's overall objectives -- it has very low content validity.
Criterion
The extent to which scores on the test are in agreement with (concurrent validity) or predict (predictive validity) an external criterion.
If the end-of-year math tests in 4th grade correlate highly with the statewide math tests, they would have high concurrent validity.
Construct
The extent to which an assessment corresponds to other variables, as predicted by some rationale or theory.
If you can correctly hypothesize that ESOL students will perform differently on a reading test than Englishspeaking students (because of theory), the assessment may have construct validity.
So, does all this talk about validity and reliability mean you need to conduct statistical analyses on your classroom quizzes? No, it doesn't. (Although you may, on occasion, want to ask one of your peers to verify the content validity of your major assessments.) However, you should be aware of the basic tenets of validity and reliability as you construct your classroom assessments, and you should be able to help parents interpret scores for the standardized exams.
Try This Reflect on the following scenarios. 1. A parent called you to ask about the reliability coefficient on a recent standardized test. The coefficient was reported as .89, and the parent thinks that must be a very low number. How would you explain to the parent that .89 is an acceptable coefficient? 2. Your school district is looking for an assessment instrument to measure reading ability. They have narrowed the selection to two possibilities -- Test A provides data indicating that it has high validity, but there is no information about its reliability. Test B provides data indicating that it has high reliability, but there is no information about its validity. Which test would you recommend? Why?
A good classroom test is valid and reliable. Validity is the quality of a test which measures what it is supposed to measure. It is the degree to which evidence, common sense, or theory supports any interpretations or conclusions about a student based on his/her test performance. More simply, it is how one knows that a math test measures students' math ability, not their reading ability. Another aspect of test validity of particular importance for classroom teachers is content-related validity. Do the items on a test fairly represent the items that could be on the test? Reasonable sources for "items that should be on the test" are class objectives, key concepts covered in lectures, main ideas, and so on. Classroom teachers who want to make sure that they have a valid test from a content standpoint often construct a table of specifications which specifically lists what was taught and how many items on a test will cover those topics. The table can even be shared with students to guide them in studying for the test and as an outline of what was most important in a unit or topic. Reliability is the quality of a test which produces scores that are not affected much by chance. Students sometimes randomly miss a question they really knew the answer to or sometimes get an answer correct just by guessing; teachers can sometimes make an error or score inconsistently with subjectively scored tests. These are problems of low reliability. Classroom teachers can solve the problem of low reliability in some simple ways. First, a test with many items will usually be more reliable than a shorter test, as whatever random fluctuations in performance occur over the course of a test will tend to cancel itself out across many items. By the same token, a class grade will itself be more reliable if it reflects many different assignments or components. Second, the more objective a test is, the fewer random errors there will be in scoring, so teachers concerned about reliability are often drawn to objectively scored tests. Even when using a subjective format, such as supply items, teachers often use a detailed scoring rubric to make the scoring as objective, and, therefore, as reliable as possible. Classroom tests can also be categorized based on what they are intended to measure. Traditional paper-and-pencil classroom tests (e.g. multiple-choice, matching, true-false) are best used to measure knowledge. They are typically objectively scored (a computer with an answer key could score it). Performance-based tests, sometimes called authentic or alternative tests, are best used to assess student skill or ability. They are typically subjectively scored (a teacher must apply some degree of opinion in evaluating the quality of a response). Performance-based tests are discussed in a separate area on this website. Tests designed to measure knowledge are usually made up of a set of individual questions. Questions can be of two types: a) selection (or select) items, which allow students to select a correct answer from a list of possible correct answers (e.g. multiple-choice, matching) and b) supply items, which require students to supply the correct answer (e.g. fill-in-the-blank, short answer). Scoring
selection items is usually quicker and objective. Scoring supply items tends to take more time and is usually more subjective. Sometimes teachers decide to use selection items when they are interested in measuring basic, lower levels of understanding (at the knowledge or comprehension level in a Bloom's taxonomy sense, Bloom et al.,1956) and use supply items if they are interested in higher levels of understanding, but a well-written selection item can still get at higher levels of understanding. Teacher-made tests can also be distinguished by when they are given and how the results are used. Tests given at the end of a unit or semester or after learning has occurred are called summative tests. Their purpose is to assess learning and performance and usually affects a student's class grade. Tests can also be given while learning is occurring, and these are called formative tests. Their purpose is to provide feedback, so students can adjust how they are learning or teachers can adjust how they are teaching. Usually these tests do not affect student grades. Classroom assessment is an integral part of teaching (Chase, 1999; Popham, 2002; Trice, 2000; Ward & Murray-Ward, 1999) and may take more than one-third of a teacher's professional time (Stiggins, 1991). Most classroom assessment involves tests that teachers have constructed themselves. It is estimated that 54 teacher-made tests are used in a typical classroom per year (Marso & Pigge, 1988) which results in perhaps billions of unique assessments yearly world-wide (Worthen, Borg, & White, 1993). Regardless of the exact frequency, teachers regularly use tests they have constructed themselves (Boothroyd, McMorris, & Pruzek , 1992; Marso & Pigge, 1988; Williams, 1991). Further, teachers place more weight on their own tests in determining grades and student progress than they do on assessments designed by others or on other data sources (Boothroyd, et al., 1992; Fennessey, 1982; Stiggins & Bridgeford, 1985; Williams, 1991). Most teachers believe that they need strong measurement skills (Wise, Lukin & Roos, 1991). While some report that they are confident in their ability to produce valid and reliable tests (Oescher & Kirby, 1990; Wise, et al., 1991), others report a level of discomfort with the quality of their own tests (Stiggins & Bridgeford, 1985) or believe that their training was inadequate (Wise, et al.). Indeed, most state certification systems and half of all teacher education programs have no assessment course requirement or even an explicit requirement that teachers have received training in assessment (Boothroyd, et al.; Stiggins, 1991; Trice, 2000; Wise, et al.). In addition, teachers have historically received little or no training or support after certification (Herman & Dorr-Bremme, 1984). The formal assessment training teachers do receive often focuses on large-scale test administration and standardized test score interpretation rather than on the test construction strategies or item-writing rules that teachers need (Stiggins, 1991; Stiggins & Bridgeford, 1985). A quality teacher-made test should follow valid item-writing rules. However, empirical studies establishing the validity of item-writing rules are in short supply and often inconclusive, and, "item writing-rules are based primarily on common sense and the conventional wisdom of test experts" (Millman & Greene, 1993; p. 353). Even after half a century of psychometric theory and research, Cronbach (1970) bemoaned the almost complete lack of scholarly attention paid to achievement test items. Twenty years after Cronbach's warning, Haladyna and Downing (1989) reasserted this claim, stating that the body of knowledge about multiple-choice item writing, for example, was still quite limited and, when revisiting the issue a decade later, added that "item writing is still largely a creative act" (Haladyna, Downing & Rodriguez, 2002, p. 329). The current empirical research literature for item-writing rules-of-thumb focuses on studies which look at the relationship between a given item format and either test performance or psychometric properties of the test related to the format choice. There are some guidelines supported by experimental or quasi-experimental designs, but the foundation of best practices in this area remains, essentially, only recommendations of experts. Common sense, along with an understanding of the nature of the two characteristics of all quality tests (validity and reliability), provides the framework that teachers use to make the best choices when designing student assessments. Developed by: Bruce B. Frey, Ph.D., University of Kansas