Posts Tagged ‘level’

There’s an aspect of language learning which everyone agrees is terribly important, but no one can quite agree on what to call it. I’m talking about combinations of words, including fixed expressions, collocations, phrasal verbs and idioms. These combinations are relatively fixed, cannot always be predicted from their elements or generated by grammar rules (Laufer, 2022). They are sometimes referred to as formulaic sequences, formulaic expressions, lexical bundles or lexical chunks, among other multiword items. They matter to English language learners because a large part of English consists of such combinations. Hill (2001) suggests this may be up to 70%. More conservative estimates report 58.6% of writing and 52.3% of speech (Erman & Warren, 2000). Some of these combinations (e.g. ‘of course’, ‘at least’) are so common that they fall into lists of the 1000 most frequent lexical items in the language.

By virtue of their ubiquity and frequency, they are important both for comprehension of reading and listening texts and for the speed at which texts can be processed. This is because knowledge of these combinations ‘makes discourse relatively predictable’ (Boers, 2020). Similarly, such knowledge can significantly contribute to spoken fluency because combinations ‘can be retrieved from memory as prefabricated units rather than being assembled at the time of speaking’ (Boer, 2020).

So far, so good, but from here on, the waters get a little muddier. Given their importance, what is the best way for a learner to acquire a decent stock of them? Are they best acquired through incidental learning (through meaning-focused reading and listening) or deliberate learning (e.g. with focused exercises of flashcards)? If the former, how on earth can we help learners to make sure that they get exposure to enough combinations enough times? If the latter, what kind of practice works best and, most importantly, which combinations should be selected? With, at the very least, many tens of thousands of such combinations, life is too short to learn them all in a deliberate fashion. Some sort of triage is necessary, but how should we go about this? Frequency of occurrence would be one obvious criterion, but this merely raises the question of what kind of database should be used to calculate frequency – the spoken discourse of children will reveal very different patterns from the written discourse of, say, applied linguists. On top of that, we cannot avoid consideration of the learners’ reasons for learning the language. If, as is statistically most probable, they are learning English to use as a lingua franca, how important or relevant is it to learn combinations that are frequent, idiomatic and comprehensible in native-speaker cultures, but may be rare and opaque in many English as a Lingua Franca contexts?

There are few, if any, answers to these big questions. Research (e.g. Pellicer-Sánchez, 2020) can give us pointers, but, the bottom line is that we are left with a series of semi-informed options (see O’Keeffe et al., 2007: 58 – 99). So, when an approach comes along that claims to use software to facilitate the learning of English formulaic expressions (Lin, 2022) I am intrigued, to say the least.

The program is, slightly misleadingly, called IdiomsTube ( A more appropriate title would have been IdiomaticityTube (as it focuses on ‘speech formulae, proverbs, sayings, similes, binomials, collocations, and so on’), but I guess ‘idioms’ is a more idiomatic word than ‘idiomaticity’. IdiomsTube allows learners to choose any English-captioned video from YouTube, which is then automatically analysed to identify from two to six formulaic expressions that are presented to the learner as learning objects. Learners are shown these items; the items are hyperlinked to (good) dictionary entries; learners watch the video and are then presented with a small variety of practice tasks. The system recommends particular videos, based on an automated analysis of their difficulty (speech rate and a frequency count of the lexical items they include) and on recommendations from previous users. The system is gamified and, for class use, teachers can track learner progress.

When an article by the program’s developer, Phoebe Lin, (in my view, more of an advertising piece than an academic one) came out in the ReCALL journal, she tweeted that she’d love feedback. I reached out but didn’t hear back. My response here is partly an evaluation of Dr Lin’s program, partly a reflection on how far technology can go in solving some of the knotty problems of language learning.

Incidental and deliberate learning

Researchers have long been interested in looking for ways of making incidental learning of lexical items more likely to happen (Boers, 2021: 39 ff.), of making it more likely that learners will notice lexical items while focusing on the content of a text. Most obviously, texts can be selected, written or modified so they contain multiple instances of a particular item (‘input flooding’). Alternatively, texts can be typographically enhanced so that particular items are highlighted in some way. But these approaches are not possible when learners are given the freedom to select any video from YouTube and when the written presentations are in the form of YouTube captions. Instead, IdiomsTube presents the items before the learner watches the video. They are, in effect, told to watch out for these items in advance. They are also given practice tasks after viewing.

The distinction between incidental and deliberate vocabulary learning is not always crystal-clear. In this case, it seems fairly clear that the approach is more slanted to deliberate learning, even though the selection of video by the learner is determined by a focus on content. Whether this works or not will depend on (1) the level-appropriacy of the videos that the learner watches, (2) the effectiveness of the program in recommending / identifying appropriate videos, (3) the ability of the program to identify appropriate formulaic expressions as learning targets in each video, and (4) the ability of the program to generate appropriate practice of these items.

Evaluating the level of YouTube videos

What makes a video easy or hard to understand? IdiomsTube attempts this analytical task by calculating (1) the speed of the speech and (2) the difficulty of the lexis as determined by the corpus frequency of these items. This gives a score out of five for each category (speed and difficulty). I looked at fifteen videos, all of which were recommended by the program. Most of the ones I looked at were scored at Speed #3 and Difficulty #1. One that I looked at, ‘Bruno Mars Carpool Karaoke’, had a speed of #2 and a difficulty of #1 (i.e. one of the easiest). The video is 15 minutes long. Here’s an extract from the first 90 seconds:

Let’s set this party off right, put yo’ pinky rings up to the moon, twenty four karat magic in the air, head to toe soul player, second verse for the hustlas, gangstas, bad bitches and ya ugly ass friends, I gotta show how a pimp get it in, and they waking up the rocket why you mad

Whoa! Without going into details, it’s clear that something has gone seriously wrong. Evaluating the difficulty of language, especially spoken language, is extremely complex (not least because there’s no objective measure of such a thing). It’s not completely dissimilar to the challenge of evaluating the accuracy, appropriacy and level of sophistication of a learner’s spoken language, and we’re a long way from being able to do that with any acceptable level of reliability. At least, we’re a long, long way from being able to do it well when there are no constraints on the kind of text (which is the case when taking the whole of YouTube as a potential source). Especially if we significantly restrict topic and text type, we can train software to do a much better job. However, this will require human input: it cannot be automated without.

The length of these 15 videos ranged from 3.02 to 29.27 minutes, with the mean length being about 10 minutes, and the median 8.32 minutes. Too damn long.

Selecting appropriate learning items

The automatic identification of formulaic language in a text presents many challenges: it is, as O’Keeffe et al. (2007: 82) note, only partially possible. A starting point is usually a list, and IdiomsTube begins with a list of 53,635 items compiled by the developer (Lin, 2022) over a number of years. The software has to match word combinations in the text to items in the list, and has to recognise variant forms. Formulaic language cannot always be identified just by matching to lists of forms: a piece of cake may just be a piece of cake, and therefore not a piece of cake to analyse. 53,365 items may sound like a lot, but a common estimate of the number of idioms in English is 25,000. The number of multiword units is much, much higher. 53,365 is not going to be enough for any reliable capture.

Since any given text is likely to contain a lot of formulaic language, the next task is to decide how to select for presentation (i.e. as learning objects) from those identified. The challenge is, as Lin (2022) remarks, both technical and theoretical: how can frequency and learnability be measured? There are no easy answers, and the approach of IdiomsTube is, by its own admission, crude. The algorithm prioritises longer items that contain lower frequency single items, and which have a low frequency of occurrence in a corpus of 40,000 randomly-sampled YouTube videos. The aim is to focus on formulaic language that is ‘more challenging in terms of composition (i.e. longer and made up of more difficult words) and, therefore, may be easier to miss due to their infrequent appearance on YouTube’. My immediate reaction is to question whether this approach will not prioritise items that are not worth the bother of deliberate learning in the first place.

The proof is in the proverbial pudding, so I looked at the learning items that were offered by my sample of 15 recommended videos. Sadly, IdiomsTube does not even begin to cut the mustard. The rest of this section details why the selection was so unsatisfactory: you may want to skip this and rejoin me at the start of the next section.

  • In total 85 target items were suggested. Of these 39 (just under half) were not fixed expressions. They were single items. Some of these single items (e.g. ‘blog’ and ‘password’ would be extremely easy for most learners). Of the others, 5 were opaque idioms (the most prototypical kind of idiom), the rest were collocations and fixed (but transparent) phrases and frames.
  • Some items (e.g. ‘I rest my case’) are limited in terms of the contexts in which they can be appropriately used.
  • Some items did not appear to be idiomatic in any way. ‘We need to talk’ and ‘able to do it’, for example, are strange selections, compared to others in their respective lists. They are also very ‘easy’: if you don’t readily understand items like these, you wouldn’t have a hope in hell of understanding the video.
  • There were a number of errors in the recommended target items. Errors included duplication of items within one set (‘get in the way’ + ‘get in the way of something’), misreading of an item (‘the shortest’ misread as ‘the shorts’), mislabelling of an item (‘vend’ instead of ‘vending machine’), linking to the wrong dictionary entry (e.g. ‘mini’ links to ‘miniskirt’, although in the video ‘mini’ = ‘small’, or, in another video, ‘stoke’ links to ‘stoked’, which is rather different!).
  • The selection of fixed expressions is sometimes very odd. In one video, the following items have been selected: get into an argument, vend, from the ground up, shovel, we need to talk, prefecture. The video contains others which would seem to be better candidates, including ‘You can’t tell’ (which appears twice), ‘in charge of’, ‘way too’ (which also appears twice), and ‘by the way’. It would seem, therefore, that some inappropriate items are selected, whilst other more appropriate ones are omitted.
  • There is a wide variation in the kind of target item. One set, for example, included: in order to do, friction, upcoming, run out of steam, able to do it, notification. Cross-checking with Pearson’s Global Scale of English, we have items ranging from A2 to C2+.

The challenges of automation

IdiomsTube comes unstuck on many levels. It fails to recommend appropriate videos to watch. It fails to suggest appropriate language to learn. It fails to provide appropriate practice. You wouldn’t know this from reading the article by Phoebe Lin in the ReCALL journal which does, however, suggest that ‘further improvements in the design and functions of IdiomsTube are needed’. Necessary they certainly are, but the interesting question is how possible they are.

My interest in IdiomsTube comes from my own experience in an app project which attempted to do something not completely dissimilar. We wanted to be able to evaluate the idiomaticity of learner-generated language, and this entailed identifying formulaic patterns in a large corpus. We wanted to develop a recommendation engine for learning objects (i.e. the lexical items) by combining measures of frequency and learnability. We wanted to generate tasks to practise collocational patterns, by trawling the corpus for contexts that lent themselves to gapfills. With some of these challenges, we failed. With others, we found a stopgap solution in human curation, writing and editing.

IdiomsTube is interesting, not because of what it tells us about how technology can facilitate language learning. It’s interesting because it tells us about the limits of technological applications to learning, and about the importance of sorting out theoretical challenges before the technical ones. It’s interesting as a case study is how not to go about developing an app: its ‘special enhancement features such as gamification, idiom-of-the-day posts, the IdiomsTube Teacher’s interface and IdiomsTube Facebook and Instagram pages’ are pointless distractions when the key questions have not been resolved. It’s interesting as a case study of something that should not have been published in an academic journal. It’s interesting as a case study of how techno-enthusiasm can blind you to the possibility that some learning challenges do not have solutions that can be automated.


Boers, F. (2020) Factors affecting the learning of multiword items. In Webb, S. (Ed.) The Routledge Handbook of Vocabulary Studies. Abingdon: Routledge. pp. 143 – 157

Boers, F. (2021) Evaluating Second Language Vocabulary and Grammar Instruction. Abingdon: Routledge

Erman, B. & Warren, B. (2000) The idiom principle and the open choice principle. Text, 20 (1): pp. 29 – 62

Hill, J. (2001) Revising priorities: from grammatical failure to collocational success. In Lewis, M. (Ed.) Teaching Collocation: further development in the Lexical Approach. Hove: LTP. Pp.47- 69

Laufer, B. (2022) Formulaic sequences and second language learning. In Szudarski, P. & Barclay, S. (Eds.) Vocabulary Theory, Patterning and Teaching. Bristol: Multilingual Matters. pp. 89 – 98

Lin, P. (2022). Developing an intelligent tool for computer-assisted formulaic language learning from YouTube videos. ReCALL 34 (2): pp.185–200.

O’Keeffe, A., McCarthy, M. & Carter, R. (2007) From Corpus to Classroom. Cambridge: Cambridge University Press

Pellicer-Sánchez, A. (2020) Learning single words vs. multiword items. In Webb, S. (Ed.) The Routledge Handbook of Vocabulary Studies. Abingdon: Routledge. pp. 158 – 173

I was intrigued to learn earlier this year that Oxford University Press had launched a new online test of English language proficiency, called the Oxford Test of English (OTE). At the conference where I first heard about it, I was struck by the fact that the presentation of the OUP sponsored plenary speaker was entitled ‘The Power of Assessment’ and dealt with formative assessment / assessment for learning. Oxford clearly want to position themselves as serious competitors to Pearson and Cambridge English in the testing business.

The brochure for the exam kicks off with a gem of a marketing slogan, ‘Smart. Smarter. SmarTest’ (geddit?), and the next few pages give us all the key information.

Faster and more flexible‘Traditional language proficiency tests’ is presumably intended to refer to the main competition (Pearson and Cambridge English). Cambridge First takes, in total, 3½ hours; the Pearson Test of English Academic takes 3 hours. The OTE takes, in total, 2 hours and 5 minutes. It can be taken, in theory, on any day of the year, although this depends on the individual Approved Test Centres, and, again, in theory, it can be booked as little as 14 days in advance. Results should take only two weeks to arrive. Further flexibility is offered in the way that candidates can pick ’n’ choose which of the four skills they want to have tests, just one or all four, although, as an incentive to go the whole hog, they will only get a ‘Certificate of Proficiency’ if they do all four.

A further incentive to do all four skills at the same time can be found in the price structure. One centre in Spain is currently offering the test for one single skill at Ꞓ41.50, but do the whole lot, and it will only set you back Ꞓ89. For a high-stakes test, this is cheap. In the UK right now, both Cambridge First and Pearson Academic cost in the region of £150, and IELTS a bit more than that. So, faster, more flexible and cheaper … Oxford means business.

Individual experience

The ‘individual experience’ on the next page of the brochure is pure marketing guff. This is, after all, a high-stakes, standardised test. It may be true that ‘the Speaking and Writing modules provide randomly generated tasks, making the overall test different each time’, but there can only be a certain number of permutations. What’s more, in ‘traditional tests’, like Cambridge First, where there is a live examiner or two, an individualised experience is unavoidable.

More interesting to me is the reference to adaptive technology. According to the brochure, ‘The Listening and Reading modules are adaptive, which means the test difficulty adjusts in response to your answers, quickly finding the right level for each test taker. This means that the questions are at just the right level of challenge, making the test shorter and less stressful than traditional proficiency tests’.

My curiosity piqued, I decided to look more closely at the Reading module. I found one practice test online which is the same as the demo that is available at the OTE website . Unfortunately, this example is not adaptive: it is at B1 level. The actual test records scores between 51 and 140, corresponding to levels A2, B1 and B2.

Test scores

The tasks in the Reading module are familiar from coursebooks and other exams: multiple choice, multiple matching and gapped texts.

Reading tasks

According to the exam specifications, these tasks are designed to measure the following skills:

  • Reading to identify main message, purpose, detail
  • Expeditious reading to identify specific information, opinion and attitude
  • Reading to identify text structure, organizational features of a text
  • Reading to identify attitude / opinion, purpose, reference, the meanings of words in context, global meaning

The ability to perform these skills depends, ultimately, on the candidate’s knowledge of vocabulary and grammar, as can be seen in the examples below.

Task 1Task 2

How exactly, I wonder, does the test difficulty adjust in response to the candidate’s answers? The algorithm that is used depends on measures of the difficulty of the test items. If these items are to be made harder or easier, the only significant way that I can see of doing this is by making the key vocabulary lower- or higher-frequency. This, in turn, is only possible if vocabulary and grammar has been tagged as being at a particular level. The most well-known tools for doing this have been developed by Pearson (with the GSE Teacher Toolkit ) and Cambridge English Profile . To the best of my knowledge, Oxford does not yet have a tool of this kind (at least, none that is publicly available). However, the data that OUP will accumulate from OTE scripts and recordings will be invaluable in building a database which their lexicographers can use in developing such a tool.

Even when a data-driven (and numerically precise) tool is available for modifying the difficulty of test items, I still find it hard to understand how the adaptivity will impact on the length or the stress of the reading test. The Reading module is only 35 minutes long and contains only 22 items. Anything that is significantly shorter must surely impact on the reliability of the test.

My conclusion from this is that the adaptive element of the Reading and Listening modules in the OTE is less important to the test itself than it is to building a sophisticated database (not dissimilar to the GSE Teacher Toolkit or Cambridge English Profile). The value of this will be found, in due course, in calibrating all OUP materials. The OTE has already been aligned to the Oxford Online Placement Test (OOPT) and, presumably, coursebooks will soon follow. This, in turn, will facilitate a vertically integrated business model, like Pearson and CUP, where everything from placement test, to coursework, to formative assessment, to final proficiency testing can be on offer.

Having spent a lot of time recently looking at vocabulary apps, I decided to put together a Christmas wish list of the features of my ideal vocabulary app. The list is not exhaustive and I’ve given more attention to some features than others. What (apart from testing) have I missed out?

1             Spaced repetition

Since the point of a vocabulary app is to help learners memorise vocabulary items, it is hard to imagine a decent system that does not incorporate spaced repetition. Spaced repetition algorithms offer one well-researched way of improving the brain’s ‘forgetting curve’. These algorithms come in different shapes and sizes, and I am not technically competent to judge which is the most efficient. However, as Peter Ellis Jones, the developer of a flashcard system called CardFlash, points out, efficiency is only one half of the rote memorisation problem. If you are not motivated to learn, the cleverness of the algorithm is moot. Fundamentally, learning software needs to be fun, rewarding, and give a solid sense of progression.

2             Quantity, balance and timing of new and ‘old’ items

A spaced repetition algorithm determines the optimum interval between repetitions, but further algorithms will be needed to determine when and with what frequency new items will be added to the deck. Once a system knows how many items a learner needs to learn and the time in which they have to do it, it is possible to determine the timing and frequency of the presentation of new items. But the system cannot know in advance how well an individual learner will learn the items (for any individual, some items will be more readily learnable than others) nor the extent to which learners will live up to their own positive expectations of time spent on-app. As most users of flashcard systems know, it is easy to fall behind, feel swamped and, ultimately, give up. An intelligent system needs to be able to respond to individual variables in order to ensure that the learning load is realistic.

3             Task variety

A standard flashcard system which simply asks learners to indicate whether they ‘know’ a target item before they flip over the card rapidly becomes extremely boring. A system which tests this knowledge soon becomes equally dull. There needs to be a variety of ways in which learners interact with an app, both for reasons of motivation and learning efficiency. It may be the case that, for an individual user, certain task types lead to more rapid gains in learning. An intelligent, adaptive system should be able to capture this information and modify the selection of task types.

Most younger learners and some adult learners will respond well to the inclusion of games within the range of task types. Examples of such games include the puzzles developed by Oliver Rose in his Phrase Maze app to accompany Quizlet practice.Phrase Maze 1Phrase Maze 2

4             Generative use

Memory researchers have long known about the ‘Generation Effect’ (see for example this piece of research from the Journal of Verbal Learning and Learning Behavior, 1978). Items are better learnt when the learner has to generate, in some (even small) way, the target item, rather than simply reading it. In vocabulary learning, this could be, for example, typing in the target word or, more simply, inserting some missing letters. Systems which incorporate task types that require generative use are likely to result in greater learning gains than simple, static flashcards with target items on one side and definitions or translations on the other.

5             Receptive and productive practice

The most basic digital flashcard systems require learners to understand a target item, or to generate it from a definition or translation prompt. Valuable as this may be, it won’t help learners much to use these items productively, since these systems focus exclusively on meaning. In order to do this, information must be provided about collocation, colligation, register, etc and these aspects of word knowledge will need to be focused on within the range of task types. At the same time, most vocabulary apps that I have seen focus primarily on the written word. Although any good system will offer an audio recording of the target item, and many will offer the learner the option of recording themselves, learners are invariably asked to type in their answers, rather than say them. For the latter, speech recognition technology will be needed. Ideally, too, an intelligent system will compare learner recordings with the audio models and provide feedback in such a way that the learner is guided towards a closer reproduction of the model.

6             Scaffolding and feedback

feebuMost flashcard systems are basically low-stakes, practice self-testing. Research (see, for example, Dunlosky et al’s metastudy ‘Improving Students’ Learning With Effective Learning Techniques: Promising Directions From Cognitive and Educational Psychology’) suggests that, as a learning strategy, practice testing has high utility – indeed, of higher utility than other strategies like keyword mnemonics or highlighting. However, an element of tutoring is likely to enhance practice testing, and, for this, scaffolding and feedback will be needed. If, for example, a learner is unable to produce a correct answer, they will probably benefit from being guided towards it through hints, in the same way as a teacher would elicit in a classroom. Likewise, feedback on why an answer is wrong (as opposed to simply being told that you are wrong), followed by encouragement to try again, is likely to enhance learning. Such feedback might, for example, point out that there is perhaps a spelling problem in the learner’s attempted answer, that the attempted answer is in the wrong part of speech, or that it is semantically close to the correct answer but does not collocate with other words in the text. The incorporation of intelligent feedback of this kind will require a number of NLP tools, since it will never be possible for a human item-writer to anticipate all the possible incorrect answers. A current example of intelligent feedback of this kind can be found in the Oxford English Vocabulary Trainer app.

7             Content

At the very least, a decent vocabulary app will need good definitions and translations (how many different languages?), and these will need to be tagged to the senses of the target items. These will need to be supplemented with all the other information that you find in a good learner’s dictionary: syntactic patterns, collocations, cognates, an indication of frequency, etc. The only way of getting this kind of high-quality content is by paying to license it from a company with expertise in lexicography. It doesn’t come cheap.

There will also need to be example sentences, both to illustrate meaning / use and for deployment in tasks. Dictionary databases can provide some of these, but they cannot be relied on as a source. This is because the example sentences in dictionaries have been selected and edited to accompany the other information provided in the dictionary, and not as items in practice exercises, which have rather different requirements. Once more, the solution doesn’t come cheap: experienced item writers will be needed.

Dictionaries describe and illustrate how words are typically used. But examples of typical usage tend to be as dull as they are forgettable. Learning is likely to be enhanced if examples are cognitively salient: weird examples with odd collocations, for example. Another thing for the item writers to think about.

A further challenge for an app which is not level-specific is that both the definitions and example sentences need to be level-specific. An A1 / A2 learner will need the kind of content that is found in, say, the Oxford Essential dictionary; B2 learners and above will need content from, say, the OALD.

8             Artwork and design

My wordbook2It’s easy enough to find artwork or photos of concrete nouns, but try to find or commission a pair of pictures that differentiate, for example, the adjectives ‘wild’ and ‘dangerous’ … What kind of pictures might illustrate simple verbs like ‘learn’ or ‘remember’? Will such illustrations be clear enough when squeezed into a part of a phone screen? Animations or very short video clips might provide a solution in some cases, but these are more expensive to produce and video files are much heavier.

With a few notable exceptions, such as the British Councils’s MyWordBook 2, design in vocabulary apps has been largely forgotten.

9             Importable and personalisable lists

Many learners will want to use a vocabulary app in association with other course material (e.g. coursebooks). Teachers, however, will inevitably want to edit these lists, deleting some items, adding others. Learners will want to do the same. This is a huge headache for app designers. If new items are going to be added to word lists, how will the definitions, example sentences and illustrations be generated? Will the database contain audio recordings of these words? How will these items be added to the practice tasks (if these include task types that go beyond simple double-sided flashcards)? NLP tools are not yet good enough to trawl a large corpus in order to select (and possibly edit) sentences that illustrate the right meaning and which are appropriate for interactive practice exercises. We can personalise the speed of learning and even the types of learning tasks, so long as the target language is predetermined. But as soon as we allow for personalisation of content, we run into difficulties.

10          Gamification

Maintaining motivation to use a vocabulary app is not easy. Gamification may help. Measuring progress against objectives will be a start. Stars and badges and leaderboards may help some users. Rewards may help others. But gamification features need to be built into the heart of the system, into the design and selection of tasks, rather than simply tacked on as an afterthought. They need to be trialled and tweaked, so analytics will be needed.

11          Teacher support

Although the use of vocabulary flashcards is beginning to catch on with English language teachers, teachers need help with ways to incorporate them in the work they do with their students. What can teachers do in class to encourage use of the app? In what ways does app use require teachers to change their approach to vocabulary work in the classroom? Reporting functions can help teachers know about the progress their students are making and provide very detailed information about words that are causing problems. But, as anyone involved in platform-based course materials knows, teachers need a lot of help.

12          And, of course, …

Apps need to be usable with different operating systems. Ideally, they should be (partially) usable offline. Loading times need to be short. They need to be easy and intuitive to use.

It’s unlikely that I’ll be seeing a vocabulary app with all of these features any time soon. Or, possibly, ever. The cost of developing something that could do all this would be extremely high, and there is no indication that there is a market that would be ready to pay the sort of prices that would be needed to cover the costs of development and turn a profit. We need to bear in mind, too, the fact that vocabulary apps can only ever assist in the initial acquisition of vocabulary: apps alone can’t solve the vocabulary learning problem (despite the silly claims of some app developers). The need for meaningful communicative use, extensive reading and listening, will not go away because a learner has been using an app. So, how far can we go in developing better and better vocabulary apps before users decide that a cheap / free app, with all its shortcomings, is actually good enough?

I posted a follow up to this post in October 2016.

There are a number of reasons why we sometimes need to describe a person’s language competence using a single number. Most of these are connected to the need for a shorthand to differentiate people, in summative testing or in job selection, for example. Numerical (or grade) allocation of this kind is so common (and especially in times when accountability is greatly valued) that it is easy to believe that this number is an objective description of a concrete entity, rather than a shorthand description of an abstract concept. In the process, the abstract concept (language competence) becomes reified and there is a tendency to stop thinking about what it actually is.

Language is messy. It’s a complex, adaptive system of communication which has a fundamentally social function. As Diane Larsen-Freeman and others have argued patterns of use strongly affect how language is acquired, is used, and changes. These processes are not independent of one another but are facets of the same complex adaptive system. […] The system consists of multiple agents (the speakers in the speech community) interacting with one another [and] the structures of language emerge from interrelated patterns of experience, social interaction, and cognitive mechanisms.

As such, competence in language use is difficult to measure. There are ways of capturing some of it. Think of the pages and pages of competency statements in the Common European Framework, but there has always been something deeply unsatisfactory about documents of this kind. How, for example, are we supposed to differentiate, exactly and objectively, between, say, can participate fully in an interview (C1) and can carry out an effective, fluent interview (B2)? The short answer is that we can’t. There are too many of these descriptors anyway and, even if we did attempt to use such a detailed tool to describe language competence, we would still be left with a very incomplete picture. There is at least one whole book devoted to attempts to test the untestable in language education (edited by Amos Paran and Lies Sercu, Multilingual Matters, 2010).

So, here is another reason why we are tempted to use shorthand numerical descriptors (such as A1, A2, B1, etc.) to describe something which is very complex and abstract (‘overall language competence’) and to reify this abstraction in the process. From there, it is a very short step to making things even more numerical, more scientific-sounding. Number-creep in recent years has brought us the Pearson Global Scale of English which can place you at a precise point on a scale from 10 to 90. Not to be outdone, Cambridge English Language Assessment now has a scale that runs from 80 points to 230, although Cambridge does, at least, allocate individual scores for four language skills.

As the title of this post suggests (in its reference to Stephen Jay Gould’s The Mismeasure of Man), I am suggesting that there are parallels between attempts to measure language competence and the sad history of attempts to measure ‘general intelligence’. Both are guilty of the twin fallacies of reification and ranking – the ordering of complex information as a gradual ascending scale. These conceptual fallacies then lead us, through the way that they push us to think about language, into making further conceptual errors about language learning. We start to confuse language testing with the ways that language learning can be structured.

We begin to granularise language. We move inexorably away from difficult-to-measure hazy notions of language skills towards what, on the surface at least, seem more readily measurable entities: words and structures. We allocate to them numerical values on our testing scales, so that an individual word can be deemed to be higher or lower on the scale than another word. And then we have a syllabus, a synthetic syllabus, that lends itself to digital delivery and adaptive manipulation. We find ourselves in a situation where materials writers for Pearson, writing for a particular ‘level’, are only allowed to use vocabulary items and grammatical structures that correspond to that ‘level’. We find ourselves, in short, in a situation where the acquisition of a complex and messy system is described as a linear, additive process. Here’s an example from the Pearson website: If you score 29 on the scale, you should be able to identify and order common food and drink from a menu; at 62, you should be able to write a structured review of a film, book or play. And because the GSE is so granular in nature, you can conquer smaller steps more often; and you are more likely to stay motivated as you work towards your goal. It’s a nonsense, a nonsense that is dictated by the needs of testing and adaptive software, but the sciency-sounding numbers help to hide the conceptual fallacies that lie beneath.

Perhaps, though, this doesn’t matter too much for most language learners. In the early stages of language learning (where most language learners are to be found), there are countless millions of people who don’t seem to mind the granularised programmes of Duolingo or Rosetta Stone, or the Grammar McNuggets of coursebooks. In these early stages, anything seems to be better than nothing, and the testing is relatively low-stakes. But as a learner’s interlanguage becomes more complex, and as the language she needs to acquire becomes more complex, attempts to granularise it and to present it in a linearly additive way become more problematic. It is for this reason, I suspect, that the appeal of granularised syllabuses declines so rapidly the more progress a learner makes. It comes as no surprise that, the further up the scale you get, the more that both teachers and learners want to get away from pre-determined syllabuses in coursebooks and software.

Adaptive language learning software is continuing to gain traction in the early stages of learning, in the initial acquisition of basic vocabulary and structures and in coming to grips with a new phonological system. It will almost certainly gain even more. But the challenge for the developers and publishers will be to find ways of making adaptive learning work for more advanced learners. Can it be done? Or will the mismeasure of language make it impossible?

FluentU, busuu, Bliu Bliu … what is it with all the ‘u’s? Hong-Kong based FluentU used to be called FluentFlix, but they changed their name a while back. The service for English learners is relatively new. Before that, they focused on Chinese, where the competition is much less fierce.

At the core of FluentU is a collection of short YouTube videos, which are sorted into 6 levels and grouped into 7 topic categories. The videos are accompanied by transcriptions. As learners watch a video, they can click on any word in the transcript. This will temporarily freeze the video and show a pop-up which offers a definition of the word, information about part of speech, a couple of examples of this word in other sentences, and more example sentences of the word from other videos that are linked on FluentU. These can, in turn, be clicked on to bring up a video collage of these sentences. Learners can click on an ‘Add to Vocab’ button, which will add the word to personalised vocabulary lists. These are later studied through spaced repetition.

FluentU describes its approach in the following terms: FluentU selects the best authentic video content from the web, and provides the scaffolding and support necessary to bring that authentic content within reach for your students. It seems appropriate, therefore, to look first at the nature of that content. At the moment, there appear to be just under 1,000 clips which are allocated to levels as follows:

Newbie 123 Intermediate 294 Advanced 111
Elementary 138 Upper Int 274 Native 40

It has to be assumed that the amount of content will continue to grow, but, for the time being, it’s not unreasonable to say that there isn’t a lot there. I looked at the Upper Intermediate level where the shortest was 32 seconds long, the longest 4 minutes 34 seconds, but most were between 1 and 2 minutes. That means that there is the equivalent of about 400 minutes (say, 7 hours) for this level.

The actual amount that anyone would want to watch / study can be seen to be significantly less when the topics are considered. These break down as follows:

Arts & entertainment 105 Everyday life 60 Science & tech 17
Business 34 Health & lifestyle 28
Culture 29 Politics & society 6

The screenshots below give an idea of the videos on offer:


I may be a little difficult, but there wasn’t much here that appealed. Forget the movie trailers for crap movies, for a start. Forget the low level business stuff, too. ‘The History of New Year’s Resolutions’ looked promising, but turned out to be a Wikipedia style piece. FluentU certainly doesn’t have the eye for interesting, original video content of someone like Jamie Keddie or Kieran Donaghy.

But, perhaps, the underwhelming content is of less importance than what you do with it. After all, if you’re really interested in content, you can just go to YouTube and struggle through the transcriptions on your own. The transcripts can be downloaded as pdfs, which, strangely are marked with a FluentU copyright notice.copyright FluentU doesn’t need to own the copyright of the videos, because they just provide links, but claiming copyright for someone else’s script seemed questionable to me. Anyway, the only real reason to be on this site is to learn some vocabulary. How well does it perform?


Level is self-selected. It wasn’t entirely clear how videos had been allocated to level, but I didn’t find any major discrepancies between FluentU’s allocation and my own, intuitive grading of the content. Clicking on words in the transcript, the look-up / dictionary function wasn’t too bad, compared to some competing products I have looked at. The system could deal with some chunks and phrases (e.g. at your service, figure out) and the definitions were appropriate to the way these had been used in context. The accuracy was far from consistent, though. Some definitions were harder than the word they were explaining (e.g. telephone = an instrument used to call someone) and some were plain silly (e.g. the definition of I is me).

have_been_definitionSome chunks were not recognised, so definitions were amusingly wonky. Come out, get through and have been were all wrong. For the phrase talk her into it, the program didn’t recognise the phrasal verb, and offered me communicate using speech for talk, and to the condition, state or form of for into.

For many words, there are pictures to help you with the meaning, but you wonder about some of them, e.g. the picture of someone clutching a suitcase to illustrate the meaning of of, or a woman holding up a finger and thumb to illustrate the meaning of what (as a pronoun).what_definition

The example sentences don’t seem to be graded in any way and are not always useful. The example sentences for of, for example, are The pages of the book are ripped, the lemurs of Madagascar and what time of day are you free. Since the definition is given as belonging to, there seems to be a problem with, at least, the last of these examples!

With the example sentence that link you to other video examples of this word being used, I found that it took a long time to load … and it really wasn’t worth waiting for.

After a catalogue of problems like this, you might wonder how I can say that this function wasn’t too bad, but I’ve seen a lot worse. It was, at least, mostly accurate.

Moving away from the ‘Watch’ options, I explored the ‘Learn’ section. Bearing in mind that I had described myself as ‘Upper Intermediate’, I was surprised to be offered the following words for study: Good morning, may, help, think, so. This then took me to the following screen:great job

I was getting increasingly confused. After watching another video, I could practise some of the words I had highlighted, but, again, I wasn’t sure quite what was going on. There was a task that asked me to ‘pick the correct translation’, but this was, in fact a multiple choice dictation task.translation task

Next, I was asked to study the meaning of the word in, followed by an unhelpful gap-fill task:gap fill

Confused? I was. I decided to look for something a little more straightforward, and clicked on a menu of vocabulary flash cards that I could import. These included sets based on copyright material from both CUP and OUP, and I wondered what these publishers might think of their property being used in this way.flashcards

FluentU claims  that it is based on the following principles:

  1. Individualized scaffolding: FluentU makes language learning easy by teaching new words with vocabulary students already know.
  2. Mastery Learning: FluentU sets students up for success by making sure they master the basics before moving on to more advanced topics.
  3. Gamification: FluentU incorporates the latest game design mechanics to make learning fun and engaging.
  4. Personalization: Each student’s FluentU experience is unlike anyone else’s. Video clips, examples, and quizzes are picked to match their vocabulary and interests.

The ‘individualized scaffolding’ is no more than common sense, dressed up in sciency-sounding language. The reference to ‘Mastery Learning’ is opaque, to say the least, with some confusion between language features and topic. The gamification is rudimentary, and the personalization is pretty limited. It doesn’t come cheap, either.

price table

In the words of its founder and CEO, self-declared ‘visionary’ Claudio Santori, Bliu Bliu is ‘the only company in the world that teaches languages we don’t even know’. This claim, which was made during a pitch  for funding in October 2014, tells us a lot about the Bliu Bliu approach. It assumes that there exists a system by which all languages can be learnt / taught, and the particular features of any given language are not of any great importance. It’s questionable, to say the least, and Santori fails to inspire confidence when he says, in the same pitch, ‘you join Bliu Bliu, you use it, we make something magical, and after a few weeks you can understand the language’.

The basic idea behind Bliu Bliu is that a language is learnt by using it (e.g. by reading or listening to texts), but that the texts need to be selected so that you know the great majority of words within them. The technological challenge, therefore, is to find (online) texts that contain the vocabulary that is appropriate for you. After that, Santori explains , ‘you progress, you input more words and you will get more text that you can understand. Hours and hours of conversations you can fully understand and listen. Not just stupid exercise from stupid grammar book. Real conversation. And in all of them you know 100% of the words. […] So basically you will have the same opportunity that a kid has when learning his native language. Listen hours and hours of native language being naturally spoken at you…at a level he/she can understand plus some challenge, everyday some more challenge, until he can pick up words very very fast’ (sic).


On entering the site, you are invited to take a test. In this, you are shown a series of words and asked to say if you find them ‘easy’ or ‘difficult’. There were 12 words in total, and each time I clicked ‘easy’. The system then tells you how many words it thinks you know, and offers you one or more words to click on. Here are the words I was presented with and, to the right, the number of words that Bliu Blu thinks I know, after clicking ‘easy’ on the preceding word.

hello 4145
teenager 5960
soap, grape 7863
receipt, washing, skateboard 9638
motorway, tram, luggage, footballer, weekday 11061


Finally, I was asked about my knowledge of other languages. I said that my French was advanced and that my Spanish and German were intermediate. On the basis of this answer, I was now told that Bliu Bliu thinks that I know 11,073 words.

Eight of the words in the test are starred in the Macmillan dictionaries, meaning they are within the most frequent 7,500 words in English. Of the other four, skateboard, footballer and tram are very international words. The last, weekday, is a readily understandable compound made up of two extremely high frequency words. How could Bliu Bliu know, with such uncanny precision, that I know 11,073 words from a test like this? I decided to try the test for French. Again, I clicked ‘easy’ for each of the twelve words that was offered. This time, I was offered a very different set of words, with low frequency items like polynôme, toponymie, diaspora, vectoriel (all of which are cognate with English words), along with the rather surprising vichy (which should have had a capital letter, as it is a proper noun). Despite finding all these words easy, I was mortified to be told that I only knew 6546 words in French.

I needn’t have bothered with the test, anyway. Irrespective of level, you are offered vocabulary sets of high frequency words. Examples of sets I was offered included [the, be, of, and, to], [way, state, say, world, two], [may, man, hear, said, call] and [life, down, any, show, t]. Bliu Bliu then gives you a series of short texts that include the target words. You can click on any word you don’t know and you are given either a definition or a translation (I opted for French translations). There is no task beyond simply reading these texts. Putting aside for the moment the question of why I was being offered these particular words when my level is advanced, how does the software perform?

The vast majority of the texts are short quotes from, and here is the first problem. Quotes tend to be pithy and often play with words: their comprehensibility is not always a function of the frequency of the words they contain. For the word ‘say’, for example, the texts included the Shakespearean quote It will have blood, they say; blood will have blood. For the word ‘world’, I was offered this line from Alexander Pope: The world forgetting, by the world forgot. Not, perhaps, the best way of learning a couple of very simple, high-frequency words. But this was the least of the problems.

The system operates on a word level. It doesn’t recognise phrases or chunks, or even phrasal verbs. So, a word like ‘down’ (in one of the lists above) is presented without consideration of its multiple senses. The first set of sentences I was asked to read for ‘down’ included: I never regretted what I turned down, You get old, you slow down, I’m Creole, and I’m down to earth, I never fall down. I always fight, I like seeing girls throw down and I don’t take criticism lying down. Not exactly the best way of getting to grips with the word ‘down’ if you don’t know it!

bliubliu2You may have noticed the inclusion of the word ‘t’ in one of the lists above. Here are the example sentences for practising this word: (1) Knock the ‘t’ off the ‘can’t’, (2) Sometimes reality T.V. can be stressful, (3) Argentina Debt Swap Won’t Avoid Default, (4) OK, I just don’t understand Nethanyahu, (5) Venezuela: Hell on Earth by Walter T Molano and (6) Work will win when wishy washy wishing won t. I paid €7.99 for one month of this!

The translation function is equally awful. With high frequency words with multiple meanings, you get a long list of possible translations, but no indication of which one is appropriate for the context you are looking at. With other words, it is sometimes, simply, wrong. For example, in the sentence, Heaven lent you a soul, Earth will lend a grave, the translation for ‘grave’ was only for the homonymous adjective. In the sentence There’s a bright spot in every dark cloud, the translation for ‘spot’ was only for verbs. And the translation for ‘but’ in We love but once, for once only are we perfectly equipped for loving was ‘mais’ (not at all what it means here!). The translation tool couldn’t handle the first ‘for’ in this sentence, either.

Bliu Bliu’s claim that Bliu Bliu knows you very well, every single word you know or don’t know is manifest nonsense and reveals a serious lack of understanding about what it means to know a word. However, as you spend more time on the system, a picture of your vocabulary knowledge is certainly built up. The texts that are offered begin to move away from the one-liners from As reading (or listening to recorded texts) is the only learning task that is offered, the intrinsic interest of the texts is crucial. Here, again, I was disappointed. Texts that I was offered were sourced from IEEE Spectrum (The World’s Largest Professional Association for the Advancement of Technology), (the home of the #1 Internet News Show in the World), Latin America News and Analysis, the Google official blog (Meet 15 Finalists and Science in Action Winner for the 2013 GoogleScience Fair) MLB Trade Rumors (a clearinghouse for relevant, legitimate baseball rumors), and a long text entitled Robert Waldmann: Policy-Relevant Macro Is All in Samuelson and Solow (1960) from a blog called Brad DeLong’s Grasping Reality……with the Neural Network of a Moderately-Intelligent Cephalopod.

There is more curated content (selected from a menu which includes sections entitled ‘18+’ and ‘Controversial Jokes’). In these texts, words that the system thinks you won’t know (most of the proper nouns for example) are highlighted. And there is a small library of novels, again, where predicted unknown words are highlighted in pink. These include Dostoyevsky, Kafka, Oscar Wilde, Gogol, Conan Doyle, Joseph Conrad, Oblomov, H.P. Lovecraft, Joyce, and Poe. You can also upload your own texts if you wish.

But, by this stage, I’d had enough and I clicked on the button to cancel my subscription. I shouldn’t have been surprised when the system crashed and a message popped up saying the system had encountered an error.

Like so many ‘language learning’ start-ups, Bliu Bliu seems to know a little, but not a lot about language learning. The Bliu Bliu blog has a video of Stephen Krashen talking about comprehensible input (it is misleadingly captioned ‘Stephen Krashen on Bliu Bliu’) in which he says that we all learn languages the same way, and that is when we get comprehensible input in a low anxiety environment. Influential though it has been, Krashen’s hypothesis remains a hypothesis, and it is generally accepted now that comprehensible input may be necessary, but it is not sufficient for language learning to take place.

The hypothesis hinges, anyway, on a definition of what is meant by ‘comprehensible’ and no one has come close to defining what precisely this means. Bliu Bliu has falsely assumed that comprehensibility can be determined by self-reporting of word knowledge, and this assumption is made even more problematic by the confusion of words (as sequences of letters) with lexical items. Bliu Bliu takes no account of lexical grammar or collocation (fundamental to any real word knowledge).

The name ‘Bliu Bliu’ was inspired by an episode from ‘Friends’ where Joey tries and fails to speak French. In the episode, according to the ‘Friends’ wiki, ‘Phoebe helps Joey prepare for an audition by teaching him how to speak French. Joey does not progress well and just speaks gibberish, thinking he’s doing a great job. Phoebe explains to the director in French that Joey is her mentally disabled younger brother so he’ll take pity on Joey.’ Bliu Bliu was an unfortunately apt choice of name.