I was intrigued to learn earlier this year that Oxford University Press had launched a new online test of English language proficiency, called the Oxford Test of English (OTE). At the conference where I first heard about it, I was struck by the fact that the presentation of the OUP sponsored plenary speaker was entitled ‘The Power of Assessment’ and dealt with formative assessment / assessment for learning. Oxford clearly want to position themselves as serious competitors to Pearson and Cambridge English in the testing business.

The brochure for the exam kicks off with a gem of a marketing slogan, ‘Smart. Smarter. SmarTest’ (geddit?), and the next few pages give us all the key information.

Faster and more flexible‘Traditional language proficiency tests’ is presumably intended to refer to the main competition (Pearson and Cambridge English). Cambridge First takes, in total, 3½ hours; the Pearson Test of English Academic takes 3 hours. The OTE takes, in total, 2 hours and 5 minutes. It can be taken, in theory, on any day of the year, although this depends on the individual Approved Test Centres, and, again, in theory, it can be booked as little as 14 days in advance. Results should take only two weeks to arrive. Further flexibility is offered in the way that candidates can pick ’n’ choose which of the four skills they want to have tests, just one or all four, although, as an incentive to go the whole hog, they will only get a ‘Certificate of Proficiency’ if they do all four.

A further incentive to do all four skills at the same time can be found in the price structure. One centre in Spain is currently offering the test for one single skill at Ꞓ41.50, but do the whole lot, and it will only set you back Ꞓ89. For a high-stakes test, this is cheap. In the UK right now, both Cambridge First and Pearson Academic cost in the region of £150, and IELTS a bit more than that. So, faster, more flexible and cheaper … Oxford means business.

Individual experience

The ‘individual experience’ on the next page of the brochure is pure marketing guff. This is, after all, a high-stakes, standardised test. It may be true that ‘the Speaking and Writing modules provide randomly generated tasks, making the overall test different each time’, but there can only be a certain number of permutations. What’s more, in ‘traditional tests’, like Cambridge First, where there is a live examiner or two, an individualised experience is unavoidable.

More interesting to me is the reference to adaptive technology. According to the brochure, ‘The Listening and Reading modules are adaptive, which means the test difficulty adjusts in response to your answers, quickly finding the right level for each test taker. This means that the questions are at just the right level of challenge, making the test shorter and less stressful than traditional proficiency tests’.

My curiosity piqued, I decided to look more closely at the Reading module. I found one practice test online which is the same as the demo that is available at the OTE website . Unfortunately, this example is not adaptive: it is at B1 level. The actual test records scores between 51 and 140, corresponding to levels A2, B1 and B2.

Test scores

The tasks in the Reading module are familiar from coursebooks and other exams: multiple choice, multiple matching and gapped texts.

Reading tasks

According to the exam specifications, these tasks are designed to measure the following skills:

  • Reading to identify main message, purpose, detail
  • Expeditious reading to identify specific information, opinion and attitude
  • Reading to identify text structure, organizational features of a text
  • Reading to identify attitude / opinion, purpose, reference, the meanings of words in context, global meaning

The ability to perform these skills depends, ultimately, on the candidate’s knowledge of vocabulary and grammar, as can be seen in the examples below.

Task 1Task 2

How exactly, I wonder, does the test difficulty adjust in response to the candidate’s answers? The algorithm that is used depends on measures of the difficulty of the test items. If these items are to be made harder or easier, the only significant way that I can see of doing this is by making the key vocabulary lower- or higher-frequency. This, in turn, is only possible if vocabulary and grammar has been tagged as being at a particular level. The most well-known tools for doing this have been developed by Pearson (with the GSE Teacher Toolkit ) and Cambridge English Profile . To the best of my knowledge, Oxford does not yet have a tool of this kind (at least, none that is publicly available). However, the data that OUP will accumulate from OTE scripts and recordings will be invaluable in building a database which their lexicographers can use in developing such a tool.

Even when a data-driven (and numerically precise) tool is available for modifying the difficulty of test items, I still find it hard to understand how the adaptivity will impact on the length or the stress of the reading test. The Reading module is only 35 minutes long and contains only 22 items. Anything that is significantly shorter must surely impact on the reliability of the test.

My conclusion from this is that the adaptive element of the Reading and Listening modules in the OTE is less important to the test itself than it is to building a sophisticated database (not dissimilar to the GSE Teacher Toolkit or Cambridge English Profile). The value of this will be found, in due course, in calibrating all OUP materials. The OTE has already been aligned to the Oxford Online Placement Test (OOPT) and, presumably, coursebooks will soon follow. This, in turn, will facilitate a vertically integrated business model, like Pearson and CUP, where everything from placement test, to coursework, to formative assessment, to final proficiency testing can be on offer.

  1. lexicojules says:

    Just a note on vocab – I don’t know whether it’s used for this test, but the new version of the Oxford 3000 word list is graded by CEFR level – publicly available here: https://www.oxfordlearnersdictionaries.com/wordlists/

    • philipjkerr says:

      Thanks, Julie. For many purposes, I find the Oxford 2000, 3000 and 5000 very useful, but their limitations for adaptive purposes are that (1) the entries are words, rather than word senses, and (2) the CEFR bands are fairly broad … not that I think that precise numerical values (for level) are particularly meaningful, but personalization programs would need to differentiate at least between, say, B1 and B1+.

  2. Ed Hackett says:

    Hi Philip,

    We read your post with great interest. Thank you very much for your review of the test. We noted some questions that you raised and hope this reply offers some more information and insight into the test development process. If you would like to explore anything further we would be very happy to talk in more detail.

    You can find detailed information about the development and validation of the test in our Test Specifications document: http://fdslive.oup.com/www.oup.com/elt/general_content/global/ote/ote_global_oxford_test_of_english_test_specifications_2019.pdf?cc=gb&selLanguage=en&mode=hub

    You mention the randomly generated tasks, and you query the number of permutations of these tasks. All test modules draw upon a large item bank, allowing the test engine to generate multiple permutations of the test. The adaptive modules, Listening and Reading, also employ an adaptive algorithm as well as randomization. We provide more detailed information about this in the ‘Test Delivery’ section (page 14) of the above document. Whilst there is a mathematical limit to the number of permutations the test can deliver, the chances of two test takers in the same test session receiving exactly the same sequence of tasks is extremely small.

    Regarding how the adaptive nature of the test impacts on the length or stress of the test, the Oxford Test of English aims to target the majority of test items at the estimated ability of the test taker, providing just the right amount of challenge. In doing this, it avoids presenting test takers with tasks that are significantly above their ability, avoiding the additional stress that this may cause. As regards test length, adaptive tests can be shorter than traditional linear tests as the majority of tasks delivered are close to the estimated ability of the test taker. The reliability of a test is not just based on the length of the test, but the closeness of the difficulty of the tasks to the estimated ability of the test taker. Test items that are too easy or too difficult for a test taker provide limited information on the actual ability of the test taker and therefore do not contribute as much to the overall reliability of the test. Adaptive tests can therefore be shorter than equivalent linear tests, whilst delivering the same reliability.

    I hope this provides some clarification and please don’t hesitate to get in touch if you would like to discuss further.

    All the best,

    Ed Hackett, Head of Assessment Research, Oxford University Press

    • philipjkerr says:

      Many thanks for taking the trouble to comment, Ed.
      I read the test specifications, and quoted them in the post, but didn’t provide a link – thanks for doing that. The document doesn’t really shed much light on how the algorithms work: it just says that algorithms adjust for level of difficulty without explaining how. I’ll try to explain what it is that I want to understand better, although I appreciate that you may not wish to answer.
      In the first item (illustrated in the post), the text itself contains high frequency vocabulary with just one (causative) structure that may cause problems. However, even if you don’t know / understand ‘heating’ and ‘get it fixed’, the basic message is still comprehensible. The challenge comes in the lexical choices of the verbs in the three possible answers. The difficulty, therefore, is essentially lexical. But if a candidate gets the answer right, it doesn’t necessarily tell us much: it might have been a good guess, it might be that in their L1 ‘encourage’ is a cognate (even if they didn’t know the word in English), it might be that their level is C2+ and the whole thing is very easy. Likewise, if the candidate gets the answer wrong, we can’t know why they got the wrong answer.
      Moving on to the second item, which is more challenging (lower frequency vocabulary, three different modal verbs, plus the fact that the stuff about ‘facilities’ is a major distraction), a correct or incorrect answer doesn’t necessarily tell us very much. When I first did this item, I got it wrong.
      In both cases, what is being evaluated does not seem to be ‘reading to identify main message, purpose, detail’. The focus is on lexical and grammatical knowledge. This may be a proxy for the reading skills which are the intended focus, but it remains a proxy and a fairly unreliable one at that. The margin of error must be high.
      This brings me on to the question of length of the test. I know that reliability is not just determined by length, but the more items that are set / answered, the more closely one can match ‘the difficulty of the tasks to the estimated ability of the test taker’. With a grand total of only 22 items, a proportion of the answers to which may not tell us very much (especially as there is a one in three chance of getting things right by chance), the algorithms don’t have a lot of data to work from.
      ‘Just the right amount of challenge’ is, I would have thought, pretty much impossible to achieve, because it presupposes precise calculation of the value of a large number of unmeasurable variables (and some of these variables may themselves vary over the short time span of a test!). Nobody likes to talk about the margins of error in adaptive, personalized approaches (preferring instead to suggest that adaptivity leads to greater reliability), but margins of error will always be there. The real question is how wide these margins are.
      Thanks again, Philip

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s