Testing is an inseparable part of the learning process. For the past few decades, there have been strong protests against the testing movement, but equally strong moves toward more and more testing and higher stakes associated with test results, especially in India. To make sense of this confusing state of affairs, it might help to examine very carefully what is meant by a ‘test’. Broadly speaking, a test measures the recall of knowledge from long-term memory, and better tests will measure the recall and application of this knowledge in new situations. By itself, thus defined, nothing could be less controversial. The problem, of course, lies in the purposes of testing, and these could raise the stakes so high that anxiety, nervous breakdown and even suicide can follow.

So it would be best to separate these two discussions-one on the purposes and uses of tests, and one on the nature and quality of tests.

Suppose a student learns about Mughal rule in a history class, or about the mean, median and mode in a mathematics class. This new knowledge is encoded in complex ways in his brain, in the pattern of interconnections among millions of neurons. As his teacher, I would naturally like to have a fairly accurate picture of what he has learned. What is a good way to go about this?

Unfortunately, there is currently no way to ‘scan’ my student’s brain to determine what is in there. Imagine how different things would be if, instead of tests and examinations, students had to submit to a painless brain scan at the end of every year! I need some other way of eliciting the stored understanding possessed by my student. One way is to give the student a context in which he needs to use those memories to perform a task, and this could be a worksheet, a homework assignment, or an academic test. His performance on the task should allow me to infer something about his knowledge and understanding.

However, this apparently simple process is susceptible to three problems: lack of clear definition of ‘understanding’ or even ‘knowledge’, lack of validity of the measurement process and unacceptably large measurement error.

To help clarify these three closely related issues, look at an academic test simply as a measuring instrument, like a metal ruler or a weighing scale or a blood pressure monitor. In the case of a metal ruler, the aim is to measure length. People are agreed on the definition of length—a dictionary says it is the ‘extent from end to end’- and we don’t hear anyone arguing that the length of a line should be the angle it makes with the North-South axis of the earth, for example. Given this agreement, the metal ruler does a fairly good job of measuring length. The way we define length and the way we measure it are closely connected, and this makes the measuring instrument ‘valid’. Also, repeated measures using the metal ruler will give the same value for length, within a narrow band of error. An instrument like this, low on measurement error, is called ‘reliable’. Without these properties, the instrument would be virtually useless. Thus to measure length, you would not use a compass (different definition of length), a metal ruler with some centimetres marked longer than others (low validity), or a calibrated elastic band (high measurement error) ... !

In the case of an academic test, the aim is to measure the student’s knowledge and understanding. Right away, we can see that we do not possess a measuring instrument as valid and reliable as a metal ruler. A person’s height is easy to measure, but her level of understanding of a concept is hidden, dynamic, and must be inferred from her behaviour. One of the aims of psychological research is to create better and better measures of all manner of psychological constructs, from love to self-esteem to intelligence. It is interesting that many people who would protest at the idea of measuring love would nevertheless cheerfully accept an IQ test as a measure of intelligence! Why do we think that intelligence or understanding are any more susceptible to accurate measurement than love?

Let us look at how we would create a test for academic understanding. There must first of all be agreement on how we define knowledge and understanding, just as there was for length. It is an interesting question to stop now and ask yourself as a teacher - how would you define knowledge and understanding in your field?

Answering the question in terms of other psychological constructs does not helpp-to say that a student has understood when he has comprehended just extends the problem to the new words.

Quite likely, your answer goes something like: He has understood when he can do x, y or z. Thus our answer is usually in terms of the measurement process itself; we say that understanding is the ability to answer a particular set of questions correctly. This is on the right track, because in fact, an unambiguous operational definition of a psychological construct must be stated in terms of the method of measuring it. The problem however is that, unlike in the case of length, not everybody will agree on my definition, on my choice of questions. I am sure you have seen tests that you felt were not a good measure of knowledge and understanding in a particular subject. For example, multiple-choice questions have been heavily criticised for being poor measures of understanding.

This fundamental problem with definition is easy to forget, especially in a time when performance on tests has become synonymous with intelligence and understanding. It is an extremely illuminating exercise for a teacher just to step back and ask herself, ‘Is this examination or test getting at the essence of what I want my students to learn? What aspects of a subject do I want all my students to master? What are the abilities that will distinguish those who are competent from those who are excellent?’

At this point, you may say: the tests and examinations that my students take are doing quite a good job of this; good students do well and weak students do badly; is that not what the test is supposed to accomplish? But there is a flaw in this situation. What we teach is highly constrained, almost completely determined, by the tests we are aiming for. Our impression as teachers of who are the ‘good’ and ‘weak’ students is shaped by the tasks we structure, which in turn are shaped by the tests. As a crude example, suppose we want to teach cooking and we know that the final examination (for some unknown reason!) involves mainly baking. We might decide to spend most, if not all, of our classes on baking. In the end, however, the course is still called ‘Cooking’, and the certificate still says ‘Diploma in Cooking’, and we may still think of our successful students as good cooks!

Keeping this in mind, let us look now at the closely linked issue of validity: is the test measuring what it claims to measure? One way of assessing the validity of a test is to see if the scores relate well with other tests. The problem with this kind of validity is that if all the tests are measuring the students’ ability to, say, memorise certain facts, they may all correlate well with each other, but it still does not guarantee that any of them is really testing understanding. Another way of checking a test’s validity is to see how it relates with performance in ‘real life’. One problem with this is that real life situations are dramatically different from end-of-year test situations! Real life involves people working together over a longer period of time on open-ended problems with information available to them, but testing is typically individual, time-bound, close-ended and closed-book. So studies have found that test scores correlate well with performance in college, but beyond college the correlations seem to break down considerably (think of all the geniuses we have heard of, who did miserably in school). Thus the question of a test’s validity, too, gives us plenty to think about.

Turning to reliability, what can we say about the measurement error of a test? Recall the calibrated elastic band that introduces error each time it measures. Is it possible for a test to introduce error while measuring knowledge or understanding that is present? The answer is: yes! Psychologists use a method called repeated testing to investigate this, and their research concludes that retrieval from memory is a highly variable process. On different occasions, being asked questions about what one has learned will not reliably bring out the same remembered knowledge. In between the repeated testing, of course, there is no further study of the material.

Repeated testing is the only way to estimate the measurement error of a test, and in real life we do not test repeatedly. But we can be sure that the error is at least several percentage points. For example, if a student scores 88 per cent, his true score may lie somewhere between 82 per cent and 9. 4 per cent. It is good to keep this in mind when we use test scores to make important decisions or judgments about students. This brings me to the next section of this article.

The purposes and uses of tests

There are several possible reasons for testing students’ learning, and I have listed the ones I could think of below. For each of these, one can ask whether tests really serve a useful purpose, and whether they serve it well. Here is a brief summary.

A test can tell me what my student has learned. The first section of this article has dealt with the extent to which a test can measure a student’s learning, and the issues involved in improving its efficacy. It can tell me what my student has learned, provided it is valid and reliable. These two crucial qualities of a good test are all too rarely examined in children’s academic tests, at least in India. I sincerely believe that it is a waste of a great deal of energy preparing students for an examination that one does not feel comfortable with. Either we in India should make efforts to change the nature of the examinations (toward which efforts are being made in some boards), or reduce the extreme emphasis on their results.

A test can tell the student what she does and does not know. True, but if the only feedback the students get from a test is a number, it cannot tell them much about what they do and do not know. Classroom tests that the teacher sets and corrects, giving necessary feedback, are more useful in this regard than the performance in the end-of-year examinations that is used to make decisions about whether the student can proceed to the next level or not.

At a broader level, the results of examinations are used to make decisions regarding admissions. This is called ‘high stakes testing’, and the way we make decisions about a student’s future is questionable, to say the least. This is especially true in our country, where distinctions are made on the basis of a fraction of a percentage point. Test scores are simply not reliable enough to make such fine distinctions. There are other, more broad based ways to select candidates for an institution: these naturally require a greater investment of time, energy and imagination. It is so much easier to just use the number cut-offs, and anyway it appears as though institutions have bought into the notion that the numbers reflect ‘real’ abilities and ‘real’ differences.

At an even broader level, testing makes schools and teachers accountable, and can evaluate the educational system that is in place. But as an almost inevitable corollary, we fall into the trap of ‘teaching to the test’, and tests finally begin to shape our curricula. This insidious gravitational process trickles down right to preschool entrance examinations! Now schools that produce ‘rankers’ are considered to be good schools (although most of the time these ranks are maintained by ‘weeding out’ academically weaker students right from Class 9. ). But a recent survey reported in India Today (27 November 2006) indicates that all is not well with our educational system, in terms of our students’ reasoning and creative problem solving abilities. This conclusion was reached after constructing new kinds of questions and administering them to students from ‘good schools’, and finding that too many children could not answer them correctly. Whether one accepts the findings of the survey or not, it is valuable to re-examine periodically the tests and examinations we have come to put so much faith in.

What about those unfortunate side-effects of testing: comparison and competition? It is true that human beings have a ‘natural’ (a word I dislike using casually, but cannot elaborate on now) tendency to want to measure themselves, compare themselves with others, and try to outdo each other. Yet nothing latches onto that tendency and makes as much of it as the whole testing movement. Of course, by amplifying our need for one-upmanship and our fear of being left behind, tests motivate us to work harder than we otherwise might! At the end of the process, it could be argued that we will accomplish more than we otherwise might have. That is, some people will argue that testing and the whole carrot-and-stick approach that it has been reduced to, has yielded higher overall achievement and helped humans reach higher levels of excellence. I have yet to be convinced that: a) this is the truth and, b) that this achievement is worth the cost. Again, achievement is defined rather narrowly, in accordance with the goals of business and industry. As David Orr says so succinctly, ‘The relationship between education and decent behaviour of any sort is not exactly straightforward.’