Professor Ralph Rowbottom
Are systems which use marks out of 10 or 100 as precise or scientific as they seem? Drawing upon experience of marking student scripts, the author looks at grading systems in general, and points to the solid merits of traditional A-to-C systems modified where necessary with pluses and minuses.
Soon after taking up a research post in social science at Brunel University in West London in the late 1960s, I was asked to do some postgraduate teaching, and in consequence, introduced to the prevailing system for marking students’ written-work. Not for Brunel, one of the new wave of technological universities, the primary-school simplicities of marks-out-of-ten, or antiquated schemes using the alphabetical symbols of classical Greek. With the sort of scientific precision looked for, nothing short of marks on a scale of 0 -100 would do.
I remember at some early point thereafter receiving an essay from one student which I thought so good that I gave it a mark of 85, only to be gently informed by the head of department concerned that it was not the done thing to allocate marks of much above the mid-seventies. Further enquiries revealed the existence of three practical benchmarks in general use: at least 40 for a bare pass, at least 55 for an adequate pass, and at least 70 for a distinction. As well as the range 75 – 100 being out of play, as it were, people rarely if ever seemed to use markings of much below 30 – 35. Thus in practice, over half the scale was completely redundant – a fact which struck me, though seemingly nobody else, as rather strange.
I struggled for a while, sometimes on my own, sometimes with co-examiners, with fine decisions like whether a particular piece of work merited a full 68 marks, a more modest 67, or even a bare 66. Gradually I evolved for my own use a much simpler system with a mere seven or so grades employing old-fashioned terms like A-minus and B-plus, into which I found I could rapidly slot incoming pieces of work (without indeed having to read, or in many cases even decipher, every individual word – still handwritten in those days). And I then translated the result into one of seven or so marks which I chose as standard, on the obligatory 0 -100 scale.
Now what lay behind the official system here, was I suggest, typical of so many ill-conceived approaches to the subject of marking or (to use a more general and appropriate term) grading. What people would like in general from any system of marking or grading is something which is very precise – hence the choice of marks out of 100 rather than just 10 – and, even more important perhaps, something which is extremely objective: something which approaches scientific measurement rather than remaining a matter of subjective opinion. Unfortunately grading is by its very nature not measurement, but a species of appraisal or evaluation, and there is all the difference in the world between the two.
Measurement versus Appraisal
Where it is possible and appropriate, true measurement is indeed objective. Looked at from a social or historical point of view, it can be seen that the process of measuring things grew precisely out of the need to provide an objective and authoritative ruling where individual judgement was fallible, or where (as often happened) the judgement of different protagonists was at odds. Would that heavy, and distant, log, actually bridge the stream if effort was spent in dragging it there? What constituted a fair portion of meat, or volume of milk, for example, in fulfilment of an agreed purchase or bargain? Did the tenant deliver the full-days work on his landlord’s fields, and just what did a ‘full-day’ actually mean? Measuring-rules. weighing-scales, and clocks were devices developed precisely to settle these and similar uncertainties and disputes.
Now, there are many areas of human activity where judgements of value are regularly made, but where measurement, in the precise sense just described, is impossible. Is this person capable enough to operate as a surgeon (say), or to install high-voltage wiring? Has that employee’s performance justified the allocation of a merit award, or not? Is that inner-city school up to standard? Are those dresses-for-sale sufficiently-well finished? Which of these graduates are crème-de-la-crème. and which are simply run-of-the-mill? Answers may be definitely needed, but no measuring instruments can be devised to settle any doubts or disagreements. Appraisals or evaluations have to he made as best as possible. In any particular case the outcome is in the end a matter of opinion, though the exercise of forming that opinion can certainly be helped by the creation of appropriate guides and criteria, and its authority enhanced by using judges who are knowledgeable and experienced in the field concerned, and regarded as people of suitable integrity and balance. In any such activities there may often be a case for expressing outcomes in the form of allocated grades, but (to repeat the point), this does not change what is going on from some kind of appraisal, as opposed to true measurement.
The Unsoundness of Numerical-Marking Systems
Often, where performances or products are required to be judged or graded. numerical marking systems are used – marks out of ten, for example (or even marks out of a hundred). Using hard numbers appears more precise and scientific than just using terms like ‘Grade A’ or ‘‘Pass with Credit’, as well as allowing the creation of totals to indicate overall performance by the process of simple addition. But the precision is illusory. and the totals far less-dependable indicators than might at first seem.
For a start, it is extremely difficult, in grading any one individual unit, to develop really useful guides or criteria to differentiate the allocation of say, three-out-of-ten from four-out-of-ten, let alone say, 67-out-of-100 from 68-out-of-100 – as already observed. And as for adding together (or averaging) the scores on many different units or tests to create an overall mark, there is always a hidden assumption that the individual units themselves are of identical value. But how can you prove, when trying to produce some overall index of a student’s performance, that the geography exam (say) was exactly as difficult as the physics exam? And who can prove without doubt when adding up marks in an arithmetic test, that 173-take-away-98 (say). is truly as easy as 2O-take-away-15? In both instances, the apparent objectivity and precision are spurious. We are forced to conclude that marking and aggregating systems of all these kinds are fundamentally flawed.
Appropriate Grading Systems
So, where performances or products need to be differentially appraised, and allocated some sort of label or certificate, how should we best proceed, given that numerical marking-systems claim more than they deliver, and are therefore basically unsound? The general answer is that some sort of non-numerical grading system is called for, and the guiding principle is that it should in each case, be no more complex than the subject under review basically warrants. As I shall now argue, the most appropriate form of labelling in all more-developed systems is one using alphabetical letters. We have already ruled out numerical-marking. Alternatives which use terms like ‘Grade One’, ‘Grade Two’, etc. fail, at any rate where more complex systems are called for, to allow for the ready use of what I describe below as ‘secondary qualifications’. Systems of stars, ‘One Star’, ‘Two Star’ and so on suffer this same disadvantage, as well as offering once again, the seductive possibility of aggregating the stars allocated to individual units to produce a total ‘staridge’, some apparently-reliable index of overall value.
Let us look then at how a variety of well-designed grading systems might evolve, starting from the very simplest.
Basic Two-box Grading
The most simple grading system possible – and one in extremely common use – merely divides the things under consideration into two categories or boxes: good enough, and not good enough. These apples are judged saleable: those are not. The quality of these manufactured components is acceptable; the quality of those is not. These people are judged fit to drive on the public roads; those are not, yet. Neither numberings nor letterings are really required here. Descriptions like Pass and Fail, or OK and Reject, express the outcomes directly and without need for further explanation.
Basic Three-box Grading
Although pass or fail is the absolute basis of all grading, experience sometimes reveals an obvious and desirable possibility of sub-dividing the things which pass into two, thus creating three different categories or boxes. In many cases, and particularly where level of human performance is at issue, it will seem natural to identify some of the outcomes not just as acceptable or adequate, but as specially good or meritorious. This further division will be particularly appropriate where it marks the possibility of two different paths of future action: as when for example those who do specially well might be encouraged to embark on some further course of study, whilst those who do not, though still having passed, are advised to proceed no further. Here, different labels are clearly required for the two different levels of pass. They might he described in such terms as Pass and Pass-with-Merit, or as Grade 1 and Grade 2, or Premium Grade and Standard Grade. As we shall now see, the labels Grade A and Grade B offer certain additional advantages.
Sometimes, still further divisions seem called for. Indeed, it is often assumed that the general category of ‘passes’ can be split not just into two main sub-divisions (as just described) but into three, four, or as many as you like. So for example, hotels in Britain are currently graded by stars from one to five (the lowest presumably at least adequate) and national school-examinations also have five grades from A to E (all of which seem to rate as passes).
However, I suggest that these and all other systems which offer to create more than two basic grades of pass, should he subject to the most critical scrutiny. It may not be too difficult for examiners or assessors to pick out from amongst all the units which do pass muster, one further group containing all or any which exhibit some kind of special quality or merit. But where these latter have to he further divided into those of ‘very special merit’,‘ very, very special merit’ and so on, it becomes difficult or impossible to draw up effective definitions and criteria to help any one examiner to make reliable decisions, let alone to co-ordinate the judgements of several. (It again becomes a bit like deciding what constitutes the sharp difference between a mark of six and a mark of seven out of ten, or a mark of 67 and a mark of 68 out of 100.)
Nevertheless it is true that when undertaking grading in certain situations that other and finer distinctions do naturally present themselves, and invite formal recognition. Drawing on experience, I have come to believe however, that these are best represented not as additional boxes or categories – that is to say, not as basic ones – but as some kind of secondary grades or qualifications within each. Let me illustrate.
Let us assume that we are using a basic three-box system, where passes of special merit, ordinary passes, and failures to pass, are identified respectively by the simple letters A, B, and C. One of the things which experience will sometimes throw-up, are cases where the performance or result not only passes, but shows significant flashes of special quality. However, the special quality is not manifested uniformly or regularly: only at some points, or in some respects. It is not sufficient to allow straightforward allocation of special merit, that is, Grade A. Should such types of case he recognised by the creation of a further main box or category? No, it is much clearer, I think, to stick to the main A-box, but to add a qualification: and the most convenient way of recording this is to use the old practice of adding a minus sign. Thus within the main category of specially-meritorious results, some may be allocated a full and straightforward A, whilst others are given an A-minus.
Then again, experience in certain situations shows a need for not just one, but two further qualifications to passes of general B-box quality. Firstly, particular specimens are not infrequently encountered which do overall warrant a pass, but which nevertheless show at certain points or in certain respects weaknesses which, had they been more widespread, would have forced a verdict of failure. Perhaps the students or producers concerned (or whoever they are) need to be warned that they are sailing perilously close to the wind and that only a little worsening is needed for any future products to become definitely unacceptable. Here again, the addition of a minus sign to the basic grade – that is, the label B-minus in this instance – is the most obvious way of registering the judgement. It is not that there is room for a separate and independent box between ‘fail’ and ‘basic pass’: but just that some secondary qualification is called for.
Experience suggests in certain circumstances the need for yet another qualification to the basic B-grade. Sometimes work is come across which undoubtedly warrants a pass. It positively glows with effort. The subject could hardly have been tackled more conscientiously or more comprehensively. But there is nevertheless nothing about it of outstanding quality. Solid, sterling, reliable: yes. Brilliant, inspired, outstanding: in no way. The natural label here is B-plus.
As to further possible secondary qualifications, things like A-double-minus, or C-plus, I can offer no definite advice beyond saying that I have found no use for them in my own experience, and suggesting that any that are proposed should be subjected to hard and critical scrutiny. With things like double plusses, or double minuses, my feeling once again is that it becomes difficult or impossible to specify clear criteria to distinguish them from their near neighbours,
Super-excellence and Abysmal-failure
Nevertheless there are at least two more secondary qualifications which seem to communicate necessary and clear messages in many situations: one at the very top of the scale, and one at the very bottom.
Sometimes one comes across a performance or product that is of such quality that it stands out strongly even from other meritorious (Grade A) cases. A potential Picasso has appeared from amongst the painters, a Nijinsky amongst the dancers, a Best amongst the footballers. Even if it does not manifest quite so dramatically, the occurrence concerned certainly seems to justify the description ‘super-excellent’. A case can perhaps he made for reserving a special box for such possible occurrences, so creating three possible levels of pass: though this has the effect of downgrading the general run of excellent passes, whilst leaving a higher grade which may be very rarely awarded. Alternatively such events might be marked by the allocation of an A-plus: though this merely seems to suggest something a little better than a straight A, whereas what is happening is the discovery of something in quite a different class. My own preference in such cases is to use the convention of a starred-A. This declares that something extraordinary is present, without significantly detracting from the general run of specially-good performers or products.
At the other extreme are instances which do not merely fail. They are so awful as to arouse feelings, more or less strong, of shock or embarrassment. Something has gone badly awry. It is not just a case of saying ‘not good enough, you must try harder if you are going to get through’. The feeling is that the person offering the performance (or product) should never have got into this endeavour in the first place; and a gross mistake has been made in letting him or her do so. If they are to do any good at all, they need to be elsewhere, attempting something much easier. This then, is not just failure, but abysmal failure. The appropriate labelling is not, I suggest, Grade D, and certainly not Grade C-minus, but something much stronger, like Unclassifiable
So where have we got to? The first thing which I have stressed is that grading is not measurement, but a process of appraisal or evaluation. Allocating grades is intrinsically subjective: in the end, a matter of opinion. Nevertheless, grading can be made more reliable if (a) the systems used are no more complex than the situation demands, and (b) the criteria for allocation are clear and effective. All numerical marking systems are basically suspect. Using them, it is all too easy to assume that aggregate or average marks give a reliable assessment of overall performance (which they do not); and, more insidiously, that the process is as objective as counting or measurement (which it is not).
The use of alphabetical labels avoids both these traps. Moreover it readily allows the awarding, through additional plusses and minuses, of finer or secondary qualifications. (And it is not too difficult for boards of examiners to decide overall grades where such are needed, by simply scanning all the individual grades which have been allocated, A, B plus, A minus, etc., whilst simultaneously taking into account any other factors which they may judge relevant – not too difficult, and a good deal preferable to a mechanical system of totalling or averaging.) However, even within this general approach. there seems to be a strict limit to the number of possible secondary qualifications, and more importantly, to the number of basic grades (boxes) that can be validly employed. With more than very few secondary qualifications, and perhaps three at the most basic grades, it appears that the criteria for deciding between neighbouring grades become too vague to be of practical use and reliability.
What are the deeper implications of this? As I have said, grading is essentially about assessing the quality of something, not the quantity of something. And when it comes to activity of the first kind, it may possibly be that the human mind can in fact only register with any certainty one basic distinction: those things in relation to any given standard that are up to it (acceptable, good. right, etc.), and those that are not (sub-standard, bad, unsuitable, wrong etc) Seen thus, those things within the family of the acceptable that are perceived as in some way specially meritorious (Grade A, outstanding, excellent, etc) might be viewed as possible contenders in another and higher league, with a different and significantly higher standard of pass or fail. ‘Super-excellent’ things (Starred-A) might be those judged to be in some even higher league; and abysmally-bad things, in one quite inferior.
In other words, it seems possible that our judgements of quality never relate just to one scale, with one central standard, but to a whole hierarchy of potential scales, each based on its own separate central standard, whether or not we are consciously aware of this. And this in turn might mark, or indeed, stem from, some natural hierarchy of levels which exists in various kinds of human ability. It might reflect the fact that human abilities, and thus in turn qualities of performance, do not vary along one long continuous scale, but form into discrete bands; bands with variation within each, but with significant jumps from one to another. (See the writings of Elliott Jaques on this score.) Thus, to take an example from one of my own particular fields of interest, it seems to be generally agreed that whilst Haydn is not quite up to Mozart in compositional ability, both are in a different, and higher class, than say, Schumann or Chopin, great though both these two latter composers undoubtedly are.
Of course there is an even bigger issue here, which is why it is necessary to grade things in the first place. There is an obvious answer where practical consequences follow: which hotel to book at, which nuclear engineer to employ, and so on. But why do we seem so anxious to grade all kinds of things when there are no discernible practical consequences? Why for example, are we are so intent on grading our children in every aspect of their education, and from the earliest point, in such things as their knowledge of geography, their athletic performance, or their position in class: when every award to a child which is less than the highest subtly downgrades the recipient? All this however, is meat for quite another discussion.
All readers are welcome to use this material for what ever purposes they may have. When doing so, please attribute authorship to Professor Ralph Rowbottom.