Posts Tagged ‘gamification’

The paragraph above was written by an AI-powered text generator called neuroflash https://app.neuro-flash.com/home which I told to produce a text on the topic ‘AI and education’. As texts on this topic go, it is both remarkable (in that it was not written by a human) and entirely unremarkable (in that it is practically indistinguishable from hundreds of human-written texts on the same subject). Neuroflash uses a neural network technology called GPT-3 – ‘a large language model’ – and ‘one of the most interesting and important AI systems ever produced’ (Chalmers, 2020). Basically, it generates text by predicting sequences of words based on huge databases. The nature of the paragraph above tells you all you need to know about the kinds of content that are usually found in texts about AI and education.

Not dissimilar from the neuroflash paragraph, educational commentary on uses of AI is characterised by (1) descriptions of AI tools already in use (e.g. speech recognition and machine translation) and (2) vague predictions which invariably refer to ‘the promise of personalised learning, adjusting what we give learners according to what they need to learn and keeping them motivated by giving them content that is of interest to them’ (Hughes, 2022). The question of what precisely will be personalised is unanswered: providing learners with optimal sets of resources (but which ones?), providing counselling services, recommendations or feedback for learners and teachers (but of what kind?) (Luckin, 2022). Nearly four years ago, I wrote https://adaptivelearninginelt.wordpress.com/2018/08/13/ai-and-language-teaching/ about the reasons why these questions remain unanswered. The short answer is that AI in language learning requires a ‘domain knowledge model’. This specifies what is to be learnt and includes an analysis of the steps that must be taken to reach that learning goal. This is lacking in SLA, or, at least, there is no general agreement on what it is. Worse, the models that are most commonly adopted in AI-driven programs (e.g. the deliberate learning of discrete items of grammar and vocabulary) are not supported by either current theory or research (see, for example, VanPatten & Smith, 2022).

In 2021, the IATEFL Learning Technologies SIG organised an event dedicated to AI in education. Unsurprisingly, there was a fair amount of input on AI in assessment, but my interest is in how AI might revolutionize how we learn and teach, not how we assess. What concrete examples did speakers provide?

Rose Luckin, the most well-known British expert on AI in education, kicked things off by mentioning three tools. One of these, Carnegie Learning, is a digital language course that looks very much like any of the ELT courses on offer from the big publishers – a fully blendable, multimedia (e.g. flashcards and videos) synthetic syllabus. This ‘blended learning solution’ is personalizable, since ‘no two students learn alike’, and, it claims, will develop a ‘lifelong love of language’. It appears to be premised on the idea of language learning as optimizing the delivery of ‘content’, of this content consisting primarily of discrete items, and of equating input with uptake. Been there, done that.

A second was Alelo Enskill https://www.alelo.com/about-us/ a chatbot / avatar roleplay program, first developed by the US military to teach Iraqi Arabic and aspects of Iraqi culture to Marines. I looked at the limitations of chatbot technology for language learning here https://adaptivelearninginelt.wordpress.com/2016/12/01/chatbots/ . The third tool mentioned by Luckin was Duolingo. Enough said.

Another speaker at this event was the founder and CEO of Edugo.AI https://www.edugo.ai/ , an AI-powered LMS which uses GPT-3. It allows schools to ‘create and upload on the platform any kind of language material (audio, video, text…). Our AI algorithms process and convert it in gamified exercises, which engage different parts of the brain, and gets students eager to practice’. Does this speaker know anything about gamification (for a quick read, I’d recommend Paul Driver (2012)) or neuroscience, I wonder. What, for that matter, does he know about language learning? Apparently, ‘language is not just about words, language is about sentences’ (Tomasello, 2022). Hmm, this doesn’t inspire confidence.

When you look at current uses of AI in language learning, there is very little (outside of testing, translation and speech ↔ text applications) that could justify enthusiastic claims that AI has any great educational potential. Skepticism seems to me a more reasonable and scientific response: de omnibus dubitandum.

Education is not the only field where AI has been talked up. When Covid hit us, AI was seen as the game-changing technology. It ‘could be deployed to make predictions, enhance efficiencies, and free up staff through automation; it could help rapidly process vast amounts of information and make lifesaving decisions’ (Chakravorti, 2022). The contribution of AI to the development of vaccines has been huge, but its role in diagnosing and triaging patients has been another matter altogether. Hundreds of predictive tools were developed: ‘none of them made a real difference, and some were potentially harmful’ (Heaven, 2021). Expectations were unrealistic and led to the deployment of tools before they were properly trialled. Thirty months down the line, a much more sober understanding of the potential of AI has emerged. Here, then, are the main lessons that have been learnt (I draw particularly on Engler, 2020, and Chakravorti, 2022) that are also relevant to education and language learning.

  • Anticipate what could go wrong before anticipating what might go right. Engler (2020) writes that ‘a poorly kept secret of AI practitioners is that 96% accuracy is suspiciously high for any machine learning problem’. In language learning, it is highly unlikely that personalized recommendations will ever reach anything even approaching this level of reliability. What are the implications for individual learners whose learning is inappropriately personalised?
  • We also know that a significant problem with AI systems is bias (O’Neil, 2016). There is a well-documented history of discriminatory outcomes because of people’s race, gender, social class or disability profile. Bias needs to be addressed proactively, not reactively.
  • Acknowledge from the outset that, for AI to work, huge amounts of data related to prior outcomes will be needed. In the cases of both Covid and language learning, much of this data will be personal. This raises immediate questions of privacy and consent, especially for learners who are children. Don’t minimise this issue.
  • Common by-products of AI use include insecure data storage, digital surveillance and consolidation of market power (Engler, 2020). Since such issues are so pervasive, assume they will be a problem, and need to be considered in evaluating the potential of any piece of AI edtech.
  • Find out what the subject matter experts have to say: ‘AI is only helpful when applied judiciously by subject-matter experts—people with long-standing experience with the problem that they are trying to solve […] Effectively predicting a badly defined problem is worse than doing nothing at all’ (Engler, 2020). It may be that the problem the AI is trying to help solve is the wrong problem and not worth solving at all (e.g. more efficient ‘delivery’ of explicit grammatical knowledge, gamifying gapfills).

To bring this post to an end, I returned to the AI / GTP-3 driven neuroflash program. This time, I asked it to write a longer piece entitled ‘The dangers of artificial intelligence in education: what could go wrong?’. Here’s what it came up with:

There are a few potential dangers of artificial intelligence in education. One is that AI could be used to replace human teachers. This could lead to a decline in the quality of education, as machines are not yet capable of replicating the full range of human emotions and social interactions that are necessary for teaching. Additionally, AI could be used to create biased or “racist” educational materials, as has already happened with some commercial software products. Finally, AI could be used to track and monitor students in ways that invade their privacy, or that could be used to manipulate their behavior.

All of these dangers are real and should be taken seriously. However, it’s important to remember that AI is still a new technology, and its impact on education is not yet fully understood. As such, we should tread cautiously when implementing AI into our educational systems.

When it comes to education, artificial intelligence could be a double-edged sword. While AI has the potential to revolutionize the field of education and help students learn in more efficient ways, there are also dangers that need to be considered.

One of the dangers of AI in education is that it could lead to a more standardized and homogenized form of learning. If all students are relying on AI-based learning tools, they may all end up learning the same things in the same way. This could lead to a loss of creativity and individuality in education.

Another danger is that AI-based learning tools could become too reliant on data and algorithms. This could lead to a situation where students are simply being taught what the machines think they should know, rather than being allowed to explore and discover new things for themselves.

Finally, there is the danger that AI could eventually replace human teachers altogether. While this may seem like a far-fetched scenario, it is not outside the realm of possibility. If AI-based learning tools become sophisticated enough, they may one day be able to do the job of teaching just as well as humans can.

These are just some of the dangers that need to be considered when it comes to artificial intelligence in education. While AI has the potential to revolutionize education, we need to be careful about how we implement it so that we don’t end up doing more harm than good.

References

Chakravorti, B. (2022) Why AI Failed to Live Up to Its Potential During the Pandemic. Harvard Business Review March 17,2022. https://hbr.org/2022/03/why-ai-failed-to-live-up-to-its-potential-during-the-pandemic

Chalmers, D. (2020) Weinberg, Justin (ed.). “GPT-3 and General Intelligence”. Daily Nous. Philosophers On GPT-3 (updated with replies by GPT-3) July 30, 2020. https://dailynous.com/2020/07/30/philosophers-gpt-3/#chalmers

Driver, P. (2012) The Irony of Gamification. In English Digital Magazine 3, British Council Portugal, pp. 21 – 24 http://digitaldebris.info/digital-debris/2011/12/31/the-irony-of-gamification-written-for-ied-magazine.html

Engler, A. (2020) A guide to healthy skepticism of artificial intelligence and coronavirus. Washington D.C.: Brookings Institution https://www.brookings.edu/research/a-guide-to-healthy-skepticism-of-artificial-intelligence-and-coronavirus/

Heaven, W. D. (2021) Hundreds of AI tools have been built to catch covid. None of them helped. MIT Technology Review, July 30, 2021. https://www.technologyreview.com/2021/07/30/1030329/machine-learning-ai-failed-covid-hospital-diagnosis-pandemic/

Hughes, G. (2022) What lies at the end of the AI rainbow? IATEFL LTSIG Newsletter Issue April 2022

Luckin, R. (2022) The implications of AI for language learning and teaching. IATEFL LTSIG Newsletter Issue April 2022

O’Neil, C. (2016) Weapons of Math Destruction. London: Allen Lane

Tomasello, G. (2022) Next Generation of AI-Language Education Software:NLP & Language Modules (GPT3). IATEFL LTSIG Newsletter Issue April 2022

VanPatten, B. & Smith, M. (2022) Explicit and Implicit Learning in Second Language Acquisition. Cambridge: Cambridge University Press

There’s an aspect of language learning which everyone agrees is terribly important, but no one can quite agree on what to call it. I’m talking about combinations of words, including fixed expressions, collocations, phrasal verbs and idioms. These combinations are relatively fixed, cannot always be predicted from their elements or generated by grammar rules (Laufer, 2022). They are sometimes referred to as formulaic sequences, formulaic expressions, lexical bundles or lexical chunks, among other multiword items. They matter to English language learners because a large part of English consists of such combinations. Hill (2001) suggests this may be up to 70%. More conservative estimates report 58.6% of writing and 52.3% of speech (Erman & Warren, 2000). Some of these combinations (e.g. ‘of course’, ‘at least’) are so common that they fall into lists of the 1000 most frequent lexical items in the language.

By virtue of their ubiquity and frequency, they are important both for comprehension of reading and listening texts and for the speed at which texts can be processed. This is because knowledge of these combinations ‘makes discourse relatively predictable’ (Boers, 2020). Similarly, such knowledge can significantly contribute to spoken fluency because combinations ‘can be retrieved from memory as prefabricated units rather than being assembled at the time of speaking’ (Boer, 2020).

So far, so good, but from here on, the waters get a little muddier. Given their importance, what is the best way for a learner to acquire a decent stock of them? Are they best acquired through incidental learning (through meaning-focused reading and listening) or deliberate learning (e.g. with focused exercises of flashcards)? If the former, how on earth can we help learners to make sure that they get exposure to enough combinations enough times? If the latter, what kind of practice works best and, most importantly, which combinations should be selected? With, at the very least, many tens of thousands of such combinations, life is too short to learn them all in a deliberate fashion. Some sort of triage is necessary, but how should we go about this? Frequency of occurrence would be one obvious criterion, but this merely raises the question of what kind of database should be used to calculate frequency – the spoken discourse of children will reveal very different patterns from the written discourse of, say, applied linguists. On top of that, we cannot avoid consideration of the learners’ reasons for learning the language. If, as is statistically most probable, they are learning English to use as a lingua franca, how important or relevant is it to learn combinations that are frequent, idiomatic and comprehensible in native-speaker cultures, but may be rare and opaque in many English as a Lingua Franca contexts?

There are few, if any, answers to these big questions. Research (e.g. Pellicer-Sánchez, 2020) can give us pointers, but, the bottom line is that we are left with a series of semi-informed options (see O’Keeffe et al., 2007: 58 – 99). So, when an approach comes along that claims to use software to facilitate the learning of English formulaic expressions (Lin, 2022) I am intrigued, to say the least.

The program is, slightly misleadingly, called IdiomsTube (https://www.idiomstube.com). A more appropriate title would have been IdiomaticityTube (as it focuses on ‘speech formulae, proverbs, sayings, similes, binomials, collocations, and so on’), but I guess ‘idioms’ is a more idiomatic word than ‘idiomaticity’. IdiomsTube allows learners to choose any English-captioned video from YouTube, which is then automatically analysed to identify from two to six formulaic expressions that are presented to the learner as learning objects. Learners are shown these items; the items are hyperlinked to (good) dictionary entries; learners watch the video and are then presented with a small variety of practice tasks. The system recommends particular videos, based on an automated analysis of their difficulty (speech rate and a frequency count of the lexical items they include) and on recommendations from previous users. The system is gamified and, for class use, teachers can track learner progress.

When an article by the program’s developer, Phoebe Lin, (in my view, more of an advertising piece than an academic one) came out in the ReCALL journal, she tweeted that she’d love feedback. I reached out but didn’t hear back. My response here is partly an evaluation of Dr Lin’s program, partly a reflection on how far technology can go in solving some of the knotty problems of language learning.

Incidental and deliberate learning

Researchers have long been interested in looking for ways of making incidental learning of lexical items more likely to happen (Boers, 2021: 39 ff.), of making it more likely that learners will notice lexical items while focusing on the content of a text. Most obviously, texts can be selected, written or modified so they contain multiple instances of a particular item (‘input flooding’). Alternatively, texts can be typographically enhanced so that particular items are highlighted in some way. But these approaches are not possible when learners are given the freedom to select any video from YouTube and when the written presentations are in the form of YouTube captions. Instead, IdiomsTube presents the items before the learner watches the video. They are, in effect, told to watch out for these items in advance. They are also given practice tasks after viewing.

The distinction between incidental and deliberate vocabulary learning is not always crystal-clear. In this case, it seems fairly clear that the approach is more slanted to deliberate learning, even though the selection of video by the learner is determined by a focus on content. Whether this works or not will depend on (1) the level-appropriacy of the videos that the learner watches, (2) the effectiveness of the program in recommending / identifying appropriate videos, (3) the ability of the program to identify appropriate formulaic expressions as learning targets in each video, and (4) the ability of the program to generate appropriate practice of these items.

Evaluating the level of YouTube videos

What makes a video easy or hard to understand? IdiomsTube attempts this analytical task by calculating (1) the speed of the speech and (2) the difficulty of the lexis as determined by the corpus frequency of these items. This gives a score out of five for each category (speed and difficulty). I looked at fifteen videos, all of which were recommended by the program. Most of the ones I looked at were scored at Speed #3 and Difficulty #1. One that I looked at, ‘Bruno Mars Carpool Karaoke’, had a speed of #2 and a difficulty of #1 (i.e. one of the easiest). The video is 15 minutes long. Here’s an extract from the first 90 seconds:

Let’s set this party off right, put yo’ pinky rings up to the moon, twenty four karat magic in the air, head to toe soul player, second verse for the hustlas, gangstas, bad bitches and ya ugly ass friends, I gotta show how a pimp get it in, and they waking up the rocket why you mad

Whoa! Without going into details, it’s clear that something has gone seriously wrong. Evaluating the difficulty of language, especially spoken language, is extremely complex (not least because there’s no objective measure of such a thing). It’s not completely dissimilar to the challenge of evaluating the accuracy, appropriacy and level of sophistication of a learner’s spoken language, and we’re a long way from being able to do that with any acceptable level of reliability. At least, we’re a long, long way from being able to do it well when there are no constraints on the kind of text (which is the case when taking the whole of YouTube as a potential source). Especially if we significantly restrict topic and text type, we can train software to do a much better job. However, this will require human input: it cannot be automated without.

The length of these 15 videos ranged from 3.02 to 29.27 minutes, with the mean length being about 10 minutes, and the median 8.32 minutes. Too damn long.

Selecting appropriate learning items

The automatic identification of formulaic language in a text presents many challenges: it is, as O’Keeffe et al. (2007: 82) note, only partially possible. A starting point is usually a list, and IdiomsTube begins with a list of 53,635 items compiled by the developer (Lin, 2022) over a number of years. The software has to match word combinations in the text to items in the list, and has to recognise variant forms. Formulaic language cannot always be identified just by matching to lists of forms: a piece of cake may just be a piece of cake, and therefore not a piece of cake to analyse. 53,365 items may sound like a lot, but a common estimate of the number of idioms in English is 25,000. The number of multiword units is much, much higher. 53,365 is not going to be enough for any reliable capture.

Since any given text is likely to contain a lot of formulaic language, the next task is to decide how to select for presentation (i.e. as learning objects) from those identified. The challenge is, as Lin (2022) remarks, both technical and theoretical: how can frequency and learnability be measured? There are no easy answers, and the approach of IdiomsTube is, by its own admission, crude. The algorithm prioritises longer items that contain lower frequency single items, and which have a low frequency of occurrence in a corpus of 40,000 randomly-sampled YouTube videos. The aim is to focus on formulaic language that is ‘more challenging in terms of composition (i.e. longer and made up of more difficult words) and, therefore, may be easier to miss due to their infrequent appearance on YouTube’. My immediate reaction is to question whether this approach will not prioritise items that are not worth the bother of deliberate learning in the first place.

The proof is in the proverbial pudding, so I looked at the learning items that were offered by my sample of 15 recommended videos. Sadly, IdiomsTube does not even begin to cut the mustard. The rest of this section details why the selection was so unsatisfactory: you may want to skip this and rejoin me at the start of the next section.

  • In total 85 target items were suggested. Of these 39 (just under half) were not fixed expressions. They were single items. Some of these single items (e.g. ‘blog’ and ‘password’ would be extremely easy for most learners). Of the others, 5 were opaque idioms (the most prototypical kind of idiom), the rest were collocations and fixed (but transparent) phrases and frames.
  • Some items (e.g. ‘I rest my case’) are limited in terms of the contexts in which they can be appropriately used.
  • Some items did not appear to be idiomatic in any way. ‘We need to talk’ and ‘able to do it’, for example, are strange selections, compared to others in their respective lists. They are also very ‘easy’: if you don’t readily understand items like these, you wouldn’t have a hope in hell of understanding the video.
  • There were a number of errors in the recommended target items. Errors included duplication of items within one set (‘get in the way’ + ‘get in the way of something’), misreading of an item (‘the shortest’ misread as ‘the shorts’), mislabelling of an item (‘vend’ instead of ‘vending machine’), linking to the wrong dictionary entry (e.g. ‘mini’ links to ‘miniskirt’, although in the video ‘mini’ = ‘small’, or, in another video, ‘stoke’ links to ‘stoked’, which is rather different!).
  • The selection of fixed expressions is sometimes very odd. In one video, the following items have been selected: get into an argument, vend, from the ground up, shovel, we need to talk, prefecture. The video contains others which would seem to be better candidates, including ‘You can’t tell’ (which appears twice), ‘in charge of’, ‘way too’ (which also appears twice), and ‘by the way’. It would seem, therefore, that some inappropriate items are selected, whilst other more appropriate ones are omitted.
  • There is a wide variation in the kind of target item. One set, for example, included: in order to do, friction, upcoming, run out of steam, able to do it, notification. Cross-checking with Pearson’s Global Scale of English, we have items ranging from A2 to C2+.

The challenges of automation

IdiomsTube comes unstuck on many levels. It fails to recommend appropriate videos to watch. It fails to suggest appropriate language to learn. It fails to provide appropriate practice. You wouldn’t know this from reading the article by Phoebe Lin in the ReCALL journal which does, however, suggest that ‘further improvements in the design and functions of IdiomsTube are needed’. Necessary they certainly are, but the interesting question is how possible they are.

My interest in IdiomsTube comes from my own experience in an app project which attempted to do something not completely dissimilar. We wanted to be able to evaluate the idiomaticity of learner-generated language, and this entailed identifying formulaic patterns in a large corpus. We wanted to develop a recommendation engine for learning objects (i.e. the lexical items) by combining measures of frequency and learnability. We wanted to generate tasks to practise collocational patterns, by trawling the corpus for contexts that lent themselves to gapfills. With some of these challenges, we failed. With others, we found a stopgap solution in human curation, writing and editing.

IdiomsTube is interesting, not because of what it tells us about how technology can facilitate language learning. It’s interesting because it tells us about the limits of technological applications to learning, and about the importance of sorting out theoretical challenges before the technical ones. It’s interesting as a case study is how not to go about developing an app: its ‘special enhancement features such as gamification, idiom-of-the-day posts, the IdiomsTube Teacher’s interface and IdiomsTube Facebook and Instagram pages’ are pointless distractions when the key questions have not been resolved. It’s interesting as a case study of something that should not have been published in an academic journal. It’s interesting as a case study of how techno-enthusiasm can blind you to the possibility that some learning challenges do not have solutions that can be automated.

References

Boers, F. (2020) Factors affecting the learning of multiword items. In Webb, S. (Ed.) The Routledge Handbook of Vocabulary Studies. Abingdon: Routledge. pp. 143 – 157

Boers, F. (2021) Evaluating Second Language Vocabulary and Grammar Instruction. Abingdon: Routledge

Erman, B. & Warren, B. (2000) The idiom principle and the open choice principle. Text, 20 (1): pp. 29 – 62

Hill, J. (2001) Revising priorities: from grammatical failure to collocational success. In Lewis, M. (Ed.) Teaching Collocation: further development in the Lexical Approach. Hove: LTP. Pp.47- 69

Laufer, B. (2022) Formulaic sequences and second language learning. In Szudarski, P. & Barclay, S. (Eds.) Vocabulary Theory, Patterning and Teaching. Bristol: Multilingual Matters. pp. 89 – 98

Lin, P. (2022). Developing an intelligent tool for computer-assisted formulaic language learning from YouTube videos. ReCALL 34 (2): pp.185–200.

O’Keeffe, A., McCarthy, M. & Carter, R. (2007) From Corpus to Classroom. Cambridge: Cambridge University Press

Pellicer-Sánchez, A. (2020) Learning single words vs. multiword items. In Webb, S. (Ed.) The Routledge Handbook of Vocabulary Studies. Abingdon: Routledge. pp. 158 – 173

NB This is an edited version of the original review.

Words & Monsters is a new vocabulary app that has caught my attention. There are three reasons for this. Firstly, because it’s free. Secondly, because I was led to believe (falsely, as it turns out) that two of the people behind it are Charles Browne and Brent Culligan, eminently respectable linguists, who were also behind the development of the New General Service List (NGSL), based on data from the Cambridge English Corpus. And thirdly, because a lot of thought, effort and investment have clearly gone into the gamification of Words & Monsters (WAM). It’s to the last of these that I’ll turn my attention first.

WAM teaches vocabulary in the context of a battle between a player’s avatar and a variety of monsters. If users can correctly match a set of target items to definitions or translations in the available time, they ‘defeat’ the monster and accumulate points. The more points you have, the higher you advance through a series of levels and ranks. There are bonuses for meeting daily and weekly goals, there are leaderboards, and trophies and medals can be won. In addition to points, players also win ‘crystals’ after successful battles, and these crystals can be used to buy accessories which change the appearance of the avatar and give the player added ‘powers’. I was never able to fully understand precisely how these ‘powers’ affected the number of points I could win in battle. It remained as baffling to me as the whole system of values with Pokemon cards, which is presumably a large part of the inspiration here. Perhaps others, more used to games like Pokemon, would find it all much more transparent.

The system of rewards is all rather complicated, but perhaps this doesn’t matter too much. In fact, it might be the case that working out how reward systems work is part of what motivates people to play games. But there is another aspect to this: the app’s developers refer in their bumf to research by Howard-Jones and Jay (2016), which suggests that when rewards are uncertain, more dopamine is released in the mid-brain and this may lead to reinforcement of learning, and, possibly, enhancement of declarative memory function. Possibly … but Howard-Jones and Jay point out that ‘the science required to inform the manipulation of reward schedules for educational benefit is very incomplete.’ So, WAM’s developers may be jumping the gun a little and overstating the applicability of the neuroscientific research, but they’re not alone in that!

If you don’t understand a reward system, it’s certain that the rewards are uncertain. But WAM takes this further in at least two ways. Firstly, when you win a ‘battle’, you have to click on a plain treasure bag to collect your crystals, and you don’t know whether you’ll get one, two, three, or zero, crystals. You are given a semblance of agency, but, essentially, the whole thing is random. Secondly, when you want to convert your crystals into accessories for your avatar, random selection determines which accessory you receive, even though, again, there is a semblance of agency. Different accessories have different power values. This extended use of what the developers call ‘the thrill of uncertain rewards’ is certainly interesting, but how effective it is is another matter. My own reaction, after quite some time spent ‘studying’, to getting no crystals or an avatar accessory that I didn’t want was primarily frustration, rather than motivation to carry on. I have no idea how typical my reaction (more ‘treadmill’ than ‘thrill’) might be.

Unsurprisingly, for an app that has so obviously thought carefully about gamification, players are encouraged to interact with each other. As part of the early promotion, WAM is running, from 15 November to 19 December, a free ‘team challenge tournament’, allowing teams of up to 8 players to compete against each other. Ingeniously, it would appear to allow teams and players of varying levels of English to play together, with the app’s algorithms determining each individual’s level of lexical knowledge and therefore the items that will be presented / tested. Social interaction is known to be an important component of successful games (Dehghanzadeh et al., 2019), but for vocabulary apps there’s a huge challenge. In order to learn vocabulary from an app, learners need to put in time – on a regular basis. Team challenge tournaments may help with initial on-boarding of players, but, in the end, learning from a vocabulary app is inevitably and largely a solitary pursuit. Over time, social interaction is unlikely to be maintained, and it is, in any case, of a very limited nature. The other features of successful games – playful freedom and intrinsically motivating tasks (Driver, 2012) – are also absent from vocabulary apps. Playful freedom is mostly incompatible with points, badges and leaderboards. And flashcard tasks, however intrinsically motivating they may be at the outset, will always become repetitive after a while. In the end, what’s left, for those users who hang around long enough, is the reward system.

It’s also worth noting that this free challenge is of limited duration: it is a marketing device attempting to push you towards the non-free use of the app, once the initial promotion is over.

Gamified motivation tools are only of value, of course, if they motivate learners to spend their time doing things that are of clear learning value. To evaluate the learning potential of WAM, then, we need to look at the content (the ‘learning objects’) and the learning tasks that supposedly lead to acquisition of these items.

When you first use WAM, you need to play for about 20 minutes, at which point algorithms determine ‘how many words [you] know and [you can] see scores for English tests such as; TOEFL, TOEIC, IELTS, EIKEN, Kyotsu Shiken, CEFR, SAT and GRE’. The developers claim that these scores correlate pretty highly with actual test scores: ‘they are about as accurate as the tests themselves’, they say. If Browne and Culligan had been behind the app, I would have been tempted to accept the claim – with reservations: after all, it still allows for one item out of 5 to be wrongly identified. But, what is this CEFR test score that is referred to? There is no CEFR test, although many tests are correlated with CEFR. The two tools that I am most familiar with which allocate CEFR levels to individual words – Cambridge’s English Vocabulary Profile and Pearson’s Global Scale of English – often conflict in their results. I suspect that ‘CEFR’ was just thrown into the list of tests as an attempt to broaden the app’s appeal.

English target words are presented and practised with their translation ‘equivalents’ in Japanese. For the moment, Japanese is the only language available, which means the app is of little use to learners who don’t know any Japanese. It’s now well-known that bilingual pairings are more effective in deliberate language learning than using definitions in the same language as the target items. This becomes immediately apparent when, for example, a word like ‘something’ is defined (by WAM) as ‘a thing not known or specified’ and ‘anything’ as ‘a thing of whatever kind’. But although I’m in no position to judge the Japanese translations, there are reasons why I would want to check the spreadsheet before recommending the app. ‘Lady’ is defined as ‘polite word for a woman’; ‘missus’ is defined as ‘wife’; and ‘aye’ is defined as ‘yes’. All of these definitions are, at best, problematic; at worst, they are misleading. Are the Japanese translations more helpful? I wonder … Perhaps these are simply words that do not lend themselves to flashcard treatment?

Because I tested in to the app at C1 level, I was not able to evaluate the selection of words at lower levels. A pity. Instead, I was presented with words like ‘ablution’, ‘abrade’, ‘anode’, and ‘auspice’. The app claims to be suitable ‘for both second-language learners and native speakers’. For lower levels of the former, this may be true (but without looking at the lexical spreadsheets, I can’t tell). But for higher levels, however much fun this may be for some people, it seems unlikely that you’ll learn very much of any value. Outside of words in, say, the top 8000 frequency band, it is practically impossible to differentiate the ‘surrender value’ of words in any meaningful way. Deliberate learning of vocabulary only makes sense with high frequency words that you have a chance of encountering elsewhere. You’d be better off reading, extensively, rather than learning random words from an app. Words, which (for reasons I’ll come on to) you probably won’t actually learn anyway.

With very few exceptions, the learning objects in WAM are single words, rather than phrases, even when the item is of little or no value outside its use in a phrase. ‘Betide’ is defined as ‘to happen to; befall’ but this doesn’t tell a learner much that is useful. It’s practically only ever used following ‘woe’ (but what does ‘woe’ mean?!). Learning items can be checked in the ‘study guide’, which will show that ‘betide’ typically follows ‘woe’, but unless you choose to refer to the study guide (and there’s no reason, in a case like this, that you would know that you need to check things out more fully), you’ll be none the wiser. In other words, checking the study guide is unlikely to betide you. ‘Wee’, as another example, is treated as two items: (1) meaning ‘very small’ as in ‘wee baby’, and (2) meaning ‘very early in the morning’ as in ‘in the wee hours’. For the latter, ‘wee’ can only collocate with ‘in the’ and ‘hours’, so it makes little sense to present it as a single word. This is also an example of how, in some cases, different meanings of particular words are treated as separate learning objects, even when the two meanings are very close and, in my view, are hardly worth learning separately. Examples include ‘czar’ and ‘assonance’. Sometimes, cognates are treated as separate learning objects (e.g. ‘adulterate’ and ‘adulteration’ or ‘dolor’ and ‘dolorous’); with other words (e.g. ‘effulgence’), only one grammatical form appears to be given. I could not begin to figure out any rationale behind any of this.

All in all, then, there are reasons to be a little skeptical about some of the content. Up to level B2 – which, in my view, is the highest level at which it makes sense to use vocabulary flashcards – it may be of value, so long as your first language is Japanese. But given the claim that it can help you prepare for the ‘CEFR test’, I have to wonder …

The learning tasks require players to match target items to translations / definitions (in both directions), with the target item sometimes in written form, sometimes spoken. Users do not, as far as I can tell, ever have to produce the target item: they only have to select. The learning relies on spaced repetition, but there is no generative effect (known to enhance memorisation). When I was experimenting, there were a few words that I did not know, but I was usually able to get the correct answer by eliminating the distractors (a choice of one from three gives players a reasonable chance of guessing correctly). WAM does not teach users how to produce words; its focus is on receptive knowledge (of a limited kind). I learn, for example, what a word like ‘aye’ or ‘missus’ kind of means, but I learn nothing about how to use it appropriately. Contrary to the claims in WAM’s bumf (that ‘all senses and dimensions of each word are fully acquired’), reading and listening comprehension speeds may be improved, but appropriate and accurate use of these words in speaking and writing is much less likely to follow. Does WAM really ‘strengthen and expand the foundation levels of cognition that support all higher level thinking’, as is claimed?

Perhaps it’s unfair to mention some of the more dubious claims of WAM’s promotional material, but here is a small selection, anyway: ‘WAM unleashes the full potential of natural motivation’. ‘WAM promotes Flow by carefully managing the ratio of unknown words. Your mind moves freely in the channel below frustration and above boredom’.

WAM is certainly an interesting project, but, like all the vocabulary apps I have ever looked at, there have to be trade-offs between optimal task design and what will fit on a mobile screen, between freedoms and flexibility for the user and the requirements of gamified points systems, between the amount of linguistic information that is desirable and the amount that spaced repetition can deal with, between attempting to make the app suitable for the greatest number of potential users and making it especially appropriate for particular kinds of users. Design considerations are always a mix of the pedagogical and the practical / commercial. And, of course, the financial. And, like most edtech products, the claims for its efficacy need to be treated with a bucket of salt.

References

Dehghanzadeh, H., Fardanesh, H., Hatami, J., Talaee, E. & Noroozi, O. (2019) Using gamification to support learning English as a second language: a systematic review, Computer Assisted Language Learning, DOI: 10.1080/09588221.2019.1648298

Driver, P. (2012) The Irony of Gamification. In English Digital Magazine 3, British Council Portugal, pp. 21 – 24 http://digitaldebris.info/digital-debris/2011/12/31/the-irony-of-gamification-written-for-ied-magazine.html

Howard-Jones, P. & Jay, T. (2016) Reward, learning and games. Current Opinion in Behavioral Sciences, 10: 65 – 72

Book_coverIn my last post, I looked at shortcomings in edtech research, mostly from outside the world of ELT. I made a series of recommendations of ways in which such research could become more useful. In this post, I look at two very recent collections of ELT edtech research. The first of these is Digital Innovations and Research in Language Learning, edited by Mavridi and Saumell, and published this February by the Learning Technologies SIG of IATEFL. I’ll refer to it here as DIRLL. It’s available free to IATEFL LT SIG members, and can be bought for $10.97 as an ebook on Amazon (US). The second is the most recent edition (February 2020) of the Language Learning & Technology journal, which is open access and available here. I’ll refer to it here as LLTJ.

In both of these collections, the focus is not on ‘technology per se, but rather issues related to language learning and language teaching, and how they are affected or enhanced by the use of digital technologies’. However, they are very different kinds of publication. Nobody involved in the production of DIRLL got paid in any way (to the best of my knowledge) and, in keeping with its provenance from a teachers’ association, has ‘a focus on the practitioner as teacher-researcher’. Almost all of the contributing authors are university-based, but they are typically involved more in language teaching than in research. With one exception (a grant from the EU), their work was unfunded.

The triannual LLTJ is funded by two American universities and published by the University of Hawaii Press. The editors and associate editors are well-known scholars in their fields. The journal’s impact factor is high, close to the impact factor of the paywalled reCALL (published by the University of Cambridge), which is the highest-ranking journal in the field of CALL. The contributing authors are all university-based, many with a string of published articles (in prestige journals), chapters or books behind them. At least six of the studies were funded by national grant-awarding bodies.

I should begin by making clear that there was much in both collections that I found interesting. However, it was not usually the research itself that I found informative, but the literature review that preceded it. Two of the chapters in DIRLL were not really research, anyway. One was the development of a template for evaluating ICT-mediated tasks in CLIL, another was an advocacy of comics as a resource for language teaching. Both of these were new, useful and interesting to me. LLTJ included a valuable literature review of research into VR in FL learning (but no actual new research). With some exceptions in both collections, though, I felt that I would have been better off curtailing my reading after the reviews. Admittedly, there wouldn’t be much in the way of literature reviews if there were no previous research to report …

It was no surprise to see the learners who were the subjects of this research were overwhelmingly university students. In fact, only one article (about a high-school project in Israel, reported in DIRLL) was not about university students. The research areas focused on reflected this bias towards tertiary contexts: online academic reading skills, academic writing, online reflective practices in teacher training programmes, etc.

In a couple of cases, the selection of experimental subjects seemed plain bizarre. Why, if you want to find out about the extent to which Moodle use can help EAP students become better academic readers (in DIRLL), would you investigate this with a small volunteer cohort of postgraduate students of linguistics, with previous experience of using Moodle and experience of teaching? Is a less representative sample imaginable? Why, if you want to investigate the learning potential of the English File Pronunciation app (reported in LLTJ), which is clearly most appropriate for A1 – B1 levels, would you do this with a group of C1-level undergraduates following a course in phonetics as part of an English Studies programme?

More problematic, in my view, was the small sample size in many of the research projects. The Israeli virtual high school project (DIRLL), previously referred to, started out with only 11 students, but 7 dropped out, primarily, it seems, because of institutional incompetence: ‘the project was probably doomed […] to failure from the start’, according to the author. Interesting as this was as an account of how not to set up a project of this kind, it is simply impossible to draw any conclusions from 4 students about the potential of a VLE for ‘interaction, focus and self-paced learning’. The questionnaire investigating experience of and attitudes towards VR (in DIRLL) was completed by only 7 (out of 36 possible) students and 7 (out of 70+ possible) teachers. As the author acknowledges, ‘no great claims can be made’, but then goes on to note the generally ‘positive attitudes to VR’. Perhaps those who did not volunteer had different attitudes? We will never know. The study of motivational videos in tertiary education (DIRLL) started off with 15 subjects, but 5 did not complete the necessary tasks. The research into L1 use in videoconferencing (LLTJ) started off with 10 experimental subjects, all with the same L1 and similar cultural backgrounds, but there was no data available from 4 of them (because they never switched into L1). The author claims that the paper demonstrates ‘how L1 is used by language learners in videoconferencing as a social semiotic resource to support social presence’ – something which, after reading the literature review, we already knew. But the paper also demonstrates quite clearly how L1 is not used by language learners in videoconferencing as a social semiotic resource to support social presence. In all these cases, it is the participants who did not complete or the potential participants who did not want to take part that have the greatest interest for me.

Unsurprisingly, the LLTJ articles had larger sample sizes than those in DIRLL, but in both collections the length of the research was limited. The production of one motivational video (DIRLL) does not really allow us to draw any conclusions about the development of students’ critical thinking skills. Two four-week interventions do not really seem long enough to me to discover anything about learner autonomy and Moodle (DIRLL). An experiment looking at different feedback modes needs more than two written assignments to reach any conclusions about student preferences (LLTJ).

More research might well be needed to compensate for the short-term projects with small sample sizes, but I’m not convinced that this is always the case. Lacking sufficient information about the content of the technologically-mediated tools being used, I was often unable to reach any conclusions. A gamified Twitter environment was developed in one project (DIRLL), using principles derived from contemporary literature on gamification. The authors concluded that the game design ‘failed to generate interaction among students’, but without knowing a lot more about the specific details of the activity, it is impossible to say whether the problem was the principles or the particular instantiation of those principles. Another project, looking at the development of pronunciation materials for online learning (LLTJ), came to the conclusion that online pronunciation training was helpful – better than none at all. Claims are then made about the value of the method used (called ‘innovative Cued Pronunciation Readings’), but this is not compared to any other method / materials, and only a very small selection of these materials are illustrated. Basically, the reader of this research has no choice but to take things on trust. The study looking at the use of Alexa to help listening comprehension and speaking fluency (LLTJ) cannot really tell us anything about IPAs unless we know more about the particular way that Alexa is being used. Here, it seems that the students were using Alexa in an interactive storytelling exercise, but so little information is given about the exercise itself that I didn’t actually learn anything at all. The author’s own conclusion is that the results, such as they are, need to be treated with caution. Nevertheless, he adds ‘the current study illustrates that IPAs may have some value to foreign language learners’.

This brings me onto my final gripe. To be told that IPAs like Alexa may have some value to foreign language learners is to be told something that I already know. This wasn’t the only time this happened during my reading of these collections. I appreciate that research cannot always tell us something new and interesting, but a little more often would be nice. I ‘learnt’ that goal-setting plays an important role in motivation and that gamification can boost short-term motivation. I ‘learnt’ that reflective journals can take a long time for teachers to look at, and that reflective video journals are also very time-consuming. I ‘learnt’ that peer feedback can be very useful. I ‘learnt’ from two papers that intercultural difficulties may be exacerbated by online communication. I ‘learnt’ that text-to-speech software is pretty good these days. I ‘learnt’ that multimodal literacy can, most frequently, be divided up into visual and auditory forms.

With the exception of a piece about online safety issues (DIRLL), I did not once encounter anything which hinted that there may be problems in using technology. No mention of the use to which student data might be put. No mention of the costs involved (except for the observation that many students would not be happy to spend money on the English File Pronunciation app) or the cost-effectiveness of digital ‘solutions’. No consideration of the institutional (or other) pressures (or the reasons behind them) that may be applied to encourage teachers to ‘leverage’ edtech. No suggestion that a zero-tech option might actually be preferable. In both collections, the language used is invariably positive, or, at least, technology is associated with positive things: uncovering the possibilities, promoting autonomy, etc. Even if the focus of these publications is not on technology per se (although I think this claim doesn’t really stand up to close examination), it’s a little disingenuous to claim (as LLTJ does) that the interest is in how language learning and language teaching is ‘affected or enhanced by the use of digital technologies’. The reality is that the overwhelming interest is in potential enhancements, not potential negative effects.

I have deliberately not mentioned any names in referring to the articles I have discussed. I would, though, like to take my hat off to the editors of DIRLL, Sophia Mavridi and Vicky Saumell, for attempting to do something a little different. I think that Alicia Artusi and Graham Stanley’s article (DIRLL) about CPD for ‘remote’ teachers was very good and should interest the huge number of teachers working online. Chryssa Themelis and Julie-Ann Sime have kindled my interest in the potential of comics as a learning resource (DIRLL). Yu-Ju Lan’s article about VR (LLTJ) is surely the most up-to-date, go-to article on this topic. There were other pieces, or parts of pieces, that I liked, too. But, to me, it’s clear that ‘more research is needed’ … much less than (1) better and more critical research, and (2) more digestible summaries of research.

About two and a half years ago when I started writing this blog, there was a lot of hype around adaptive learning and the big data which might drive it. Two and a half years are a long time in technology. A look at Google Trends suggests that interest in adaptive learning has been pretty static for the last couple of years. It’s interesting to note that 3 of the 7 lettered points on this graph are Knewton-related media events (including the most recent, A, which is Knewton’s latest deal with Hachette) and 2 of them concern McGraw-Hill. It would be interesting to know whether these companies follow both parts of Simon Cowell’s dictum of ‘Create the hype, but don’t ever believe it’.

Google_trends

A look at the Hype Cycle (see here for Wikipedia’s entry on the topic and for criticism of the hype of Hype Cycles) of the IT research and advisory firm, Gartner, indicates that both big data and adaptive learning have now slid into the ‘trough of disillusionment’, which means that the market has started to mature, becoming more realistic about how useful the technologies can be for organizations.

A few years ago, the Gates Foundation, one of the leading cheerleaders and financial promoters of adaptive learning, launched its Adaptive Learning Market Acceleration Program (ALMAP) to ‘advance evidence-based understanding of how adaptive learning technologies could improve opportunities for low-income adults to learn and to complete postsecondary credentials’. It’s striking that the program’s aims referred to how such technologies could lead to learning gains, not whether they would. Now, though, with the publication of a report commissioned by the Gates Foundation to analyze the data coming out of the ALMAP Program, things are looking less rosy. The report is inconclusive. There is no firm evidence that adaptive learning systems are leading to better course grades or course completion. ‘The ultimate goal – better student outcomes at lower cost – remains elusive’, the report concludes. Rahim Rajan, a senior program office for Gates, is clear: ‘There is no magical silver bullet here.’

The same conclusion is being reached elsewhere. A report for the National Education Policy Center (in Boulder, Colorado) concludes: Personalized Instruction, in all its many forms, does not seem to be the transformational technology that is needed, however. After more than 30 years, Personalized Instruction is still producing incremental change. The outcomes of large-scale studies and meta-analyses, to the extent they tell us anything useful at all, show mixed results ranging from modest impacts to no impact. Additionally, one must remember that the modest impacts we see in these meta-analyses are coming from blended instruction, which raises the cost of education rather than reducing it (Enyedy, 2014: 15 -see reference at the foot of this post). In the same vein, a recent academic study by Meg Coffin Murray and Jorge Pérez (2015, ‘Informing and Performing: A Study Comparing Adaptive Learning to Traditional Learning’) found that ‘adaptive learning systems have negligible impact on learning outcomes’.

future-ready-learning-reimagining-the-role-of-technology-in-education-1-638In the latest educational technology plan from the U.S. Department of Education (‘Future Ready Learning: Reimagining the Role of Technology in Education’, 2016) the only mentions of the word ‘adaptive’ are in the context of testing. And the latest OECD report on ‘Students, Computers and Learning: Making the Connection’ (2015), finds, more generally, that information and communication technologies, when they are used in the classroom, have, at best, a mixed impact on student performance.

There is, however, too much money at stake for the earlier hype to disappear completely. Sponsored cheerleading for adaptive systems continues to find its way into blogs and national magazines and newspapers. EdSurge, for example, recently published a report called ‘Decoding Adaptive’ (2016), sponsored by Pearson, that continues to wave the flag. Enthusiastic anecdotes take the place of evidence, but, for all that, it’s a useful read.

In the world of ELT, there are plenty of sales people who want new products which they can call ‘adaptive’ (and gamified, too, please). But it’s striking that three years after I started following the hype, such products are rather thin on the ground. Pearson was the first of the big names in ELT to do a deal with Knewton, and invested heavily in the company. Their relationship remains close. But, to the best of my knowledge, the only truly adaptive ELT product that Pearson offers is the PTE test.

Macmillan signed a contract with Knewton in May 2013 ‘to provide personalized grammar and vocabulary lessons, exam reviews, and supplementary materials for each student’. In December of that year, they talked up their new ‘big tree online learning platform’: ‘Look out for the Big Tree logo over the coming year for more information as to how we are using our partnership with Knewton to move forward in the Language Learning division and create content that is tailored to students’ needs and reactive to their progress.’ I’ve been looking out, but it’s all gone rather quiet on the adaptive / platform front.

In September 2013, it was the turn of Cambridge to sign a deal with Knewton ‘to create personalized learning experiences in its industry-leading ELT digital products for students worldwide’. This year saw the launch of a major new CUP series, ‘Empower’. It has an online workbook with personalized extra practice, but there’s nothing (yet) that anyone would call adaptive. More recently, Cambridge has launched the online version of the 2nd edition of Touchstone. Nothing adaptive there, either.

Earlier this year, Cambridge published The Cambridge Guide to Blended Learning for Language Teaching, edited by Mike McCarthy. It contains a chapter by M.O.Z. San Pedro and R. Baker on ‘Adaptive Learning’. It’s an enthusiastic account of the potential of adaptive learning, but it doesn’t contain a single reference to language learning or ELT!

So, what’s going on? Skepticism is becoming the order of the day. The early hype of people like Knewton’s Jose Ferreira is now understood for what it was. Companies like Macmillan got their fingers badly burnt when they barked up the wrong tree with their ‘Big Tree’ platform.

Noel Enyedy captures a more contemporary understanding when he writes: Personalized Instruction is based on the metaphor of personal desktop computers—the technology of the 80s and 90s. Today’s technology is not just personal but mobile, social, and networked. The flexibility and social nature of how technology infuses other aspects of our lives is not captured by the model of Personalized Instruction, which focuses on the isolated individual’s personal path to a fixed end-point. To truly harness the power of modern technology, we need a new vision for educational technology (Enyedy, 2014: 16).

Adaptive solutions aren’t going away, but there is now a much better understanding of what sorts of problems might have adaptive solutions. Testing is certainly one. As the educational technology plan from the U.S. Department of Education (‘Future Ready Learning: Re-imagining the Role of Technology in Education’, 2016) puts it: Computer adaptive testing, which uses algorithms to adjust the difficulty of questions throughout an assessment on the basis of a student’s responses, has facilitated the ability of assessments to estimate accurately what students know and can do across the curriculum in a shorter testing session than would otherwise be necessary. In ELT, Pearson and EF have adaptive tests that have been well researched and designed.

Vocabulary apps which deploy adaptive technology continue to become more sophisticated, although empirical research is lacking. Automated writing tutors with adaptive corrective feedback are also developing fast, and I’ll be writing a post about these soon. Similarly, as speech recognition software improves, we can expect to see better and better automated adaptive pronunciation tutors. But going beyond such applications, there are bigger questions to ask, and answers to these will impact on whatever direction adaptive technologies take. Large platforms (LMSs), with or without adaptive software, are already beginning to look rather dated. Will they be replaced by integrated apps, or are apps themselves going to be replaced by bots (currently riding high in the Hype Cycle)? In language learning and teaching, the future of bots is likely to be shaped by developments in natural language processing (another topic about which I’ll be blogging soon). Nobody really has a clue where the next two and a half years will take us (if anywhere), but it’s becoming increasingly likely that adaptive learning will be only one very small part of it.

 

Enyedy, N. 2014. Personalized Instruction: New Interest, Old Rhetoric, Limited Results, and the Need for a New Direction for Computer-Mediated Learning. Boulder, CO: National Education Policy Center. Retrieved 17.07.16 from http://nepc.colorado.edu/publication/personalized-instruction

I have been putting in a lot of time studying German vocabulary with Memrise lately, but this is not a review of the Memrise app. For that, I recommend you read Marek Kiczkowiak’s second post on this app. Like me, he’s largely positive, although I am less enthusiastic about Memrise’s USP, the use of mnemonics. It’s not that mnemonics don’t work – there’s a lot of evidence that they do: it’s just that there is little or no evidence that they’re worth the investment of time.

Time … as I say, I have been putting in the hours. Every day, for over a month, averaging a couple of hours a day, it’s enough to get me very near the top of the leader board (which I keep a very close eye on) and it means that I am doing more work than 99% of other users. And, yes, my German is improving.

Putting in the time is the sine qua non of any language learning and a well-designed app must motivate users to do this. Relevant content will be crucial, as will satisfactory design, both visual and interactive. But here I’d like to focus on the two other key elements: task design / variety and gamification.

Memrise offers a limited range of task types: presentation cards (with word, phrase or sentence with translation and audio recording), multiple choice (target item with four choices), unscrambling letters or words, and dictation (see below).

Screenshot_2016-05-24-08-10-42Screenshot_2016-05-24-08-10-57Screenshot_2016-05-24-08-11-24Screenshot_2016-05-24-08-11-45Screenshot_2016-05-24-08-12-51Screenshot_2016-05-24-08-13-44

As Marek writes, it does get a bit repetitive after a while (although less so than thumbing through a pack of cardboard flashcards). The real problem, though, is that there are only so many things an app designer can do with standard flashcards, if they are to contribute to learning. True, there could be a few more game-like tasks (as with Quizlet), races against the clock as you pop word balloons or something of the sort, but, while these might, just might, help with motivation, these games rarely, if ever, contribute much to learning.

What’s more, you’ll get fed up with the games sooner or later if you’re putting in serious study hours. Even if Memrise were to double the number of activity types, I’d have got bored with them by now, in the same way I got bored with the Quizlet games. Bear in mind, too, that I’ve only done a month: I have at least another two months to go before I finish the level I’m working on. There’s another issue with ‘fun’ activities / games which I’ll come on to later.

The options for task variety in vocabulary / memory apps are therefore limited. Let’s look at gamification. Memrise has leader boards (weekly, monthly, ‘all time’), streak badges, daily goals, email reminders and (in the laptop and premium versions) a variety of graphs that allow you to analyse your study patterns. Your degree of mastery of learning items is represented by a growing flower that grows leaves, flowers and withers. None of this is especially original or different from similar apps.

Screenshot_2016-05-24-19-17-14The trouble with all of this is that it can only work for a certain time and, for some people, never. There’s always going to be someone like me who can put in a couple of hours a day more than you can. Or someone, in my case, like ‘Nguyenduyha’, who must be doing about four hours a day, and who, I know, is out of my league. I can’t compete and the realisation slowly dawns that my life would be immeasurably sadder if I tried to.

Having said that, I have tried to compete and the way to do so is by putting in the time on the ‘speed review’. This is the closest that Memrise comes to a game. One hundred items are flashed up with four multiple choices and these are against the clock. The quicker you are, the more points you get, and if you’re too slow, or you make a mistake, you lose a life. That’s how you gain lots of points with Memrise. The problem is that, at best, this task only promotes receptive knowledge of the items, which is not what I need by this stage. At worst, it serves no useful learning function at all because I have learnt ways of doing this well which do not really involve me processing meaning at all. As Marek says in his post (in reference to Quizlet), ‘I had the feeling that sometimes I was paying more attention to ‘winning’ the game and scoring points, rather than to the words on the screen.’ In my case, it is not just a feeling: it’s an absolute certainty.

desktop_dashboard

Sadly, the gamification is working against me. The more time I spend on the U-Bahn doing Memrise, the less time I spend reading the free German-language newspapers, the less time I spend eavesdropping on conversations. Two hours a day is all I have time for for my German study, and Memrise is eating it all up. I know that there are other, and better, ways of learning. In order to do what I know I should be doing, I need to ignore the gamification. For those, more reasonable, students, who can regularly do their fifteen minutes a day, day in – day out, the points and leader boards serve no real function at all.

Cheating at gamification, or gaming the system, is common in app-land. A few years ago, Memrise had to take down their leader board when they realised that cheating was taking place. There’s an inexorable logic to this: gamification is an attempt to motivate by rewarding through points, rather than the reward coming from the learning experience. The logic of the game overtakes itself. Is ‘Nguyenduyha’ cheating, or do they simply have nothing else to do all day? Am I cheating by finding time to do pointless ‘speed reviews’ that earn me lots of points?

For users like myself, then, gamification design needs to be a delicate balancing act. For others, it may be largely an irrelevance. I’ve been working recently on a general model of vocabulary app design that looks at two very different kinds of user. On the one hand, there are the self-motivated learners like myself or the millions of other who have chosen to use self-study apps. On the other, there are the millions of students in schools and colleges, studying English among other subjects, some of whom are now being told to use the vocabulary apps that are beginning to appear packaged with their coursebooks (or other learning material). We’ve never found entirely satisfactory ways of making these students do their homework, and the fact that this homework is now digital will change nothing (except, perhaps, in the very, very short term). The incorporation of games and gamification is unlikely to change much either: there will always be something more interesting and motivating (and unconnected with language learning) elsewhere.

Teachers and college principals may like the idea of gamification (without having really experienced it themselves) for their students. But more important for most of them is likely to be the teacher dashboard: the means by which they can check that their students are putting the time in. Likewise, they will see the utility of automated email reminders that a student is not working hard enough to meet their learning objectives, more and more regular tests that contribute to overall course evaluation, comparisons with college, regional or national benchmarks. Technology won’t solve the motivation issue, but it does offer efficient means of control.

Screenshot_2016-04-29-09-48-05I call Lern Deutsch a vocabulary app, although it’s more of a game than anything else. Developed by the Goethe Institute, the free app was probably designed primarily as a marketing tool rather than a serious attempt to develop an educational language app. It’s available for speakers of Arabic, English, Spanish, Italian, French, Italian, Portuguese and Russian. It’s aimed at A1 learners.

Users of the app create an avatar and roam around a virtual city, learning new vocabulary and practising situational language. They can interact in language challenges with other players. As they explore, they earn Goethe coins, collect accessories for their avatars and progress up a leader board.Screenshot_2016-04-29-09-50-12

As they explore the virtual city, populated by other avatars, they find objects that can be clicked on to add to their vocabulary list. They hear a recording of an example sentence containing the target word, with the word gapped and three multiple choice possibilities. They are then required to type the missing word (see the image below). After collecting a certain number of words, they complete exercises which include the following task types:

  • Jumbled sentences
  • Audio recording of individual words and multiple choice selection
  • Gapped sentences with multiple choice answers
  • Dictation
  • Example sentences containing target item and multiple choice pictures
  • Typing sentences which are buried in a string of random letters

Screenshot_2016-05-02-14-23-07Screenshot_2016-05-02-14-26-13

Screenshot_2016-05-02-14-27-21Screenshot_2016-05-02-14-31-49

 

 

 

 

 

 

 

 

 

The developers have focused their attention on providing variety: engagement and ‘fun’ override other considerations. But how does the app stand up as a language learning tool? Surprisingly, for something developed by the Goethe Institute, it’s less than impressive.

The words that you collect as you navigate the virtual city are all nouns (Hotel, Auto, Mann, Banane, etc), but some (e.g. Sehenswurdigkeit) seem out of level. Any app that uses illustrations as the basic means of conveying meaning runs into problems when it moves away from concrete nouns, but a diet of nouns only (as here) is of necessarily limited value. Other parts of speech are introduced via the example sentences, but no help with meaning is provided so when you come across the word for ‘egg’, for example, your example sentence is ‘Ich möchte das Frühstück mit Ei.’ It’s all very well embedding the target vocabulary in example sentences that have a functional value, but example sentences are only of value if they are understandable: the app badly needs a look-up function for the surrounding language.

The practice exercises are varied, too, but they also vary in their level of difficulty. It makes sense to do receptive / recognition tasks before productive ones, but there is no evidence that I could see of pedagogical considerations of this kind. Neither does there seem to be any spaced repetition at work: the app is driven by the needs of the game design rather than any learning principles.

It’s unclear to me who the app is for. The functional language that is presented is adult: the situations are adult situations (buying a bed, booking a hotel room, ordering a beer). However, the graphic design and the gamification features are juvenile (adding a pirate patch to your avatar, for example).

The lack of attention to the business of learning is especially striking in the English of the English language version that I used. The number of examples of dodgy English that I came across do not inspire confidence.

  • Quite alright! You win your first Goethe coin.
  • What sightseeings do you spot in the city center and the train station?
  • Have a picknick in the park. You now have a picnic in the park with the musician.
  • You still search for your teacher. Whom do you meet in the park? What do they work?

 

All in all, it’s an interesting example of a gamified approach to language, and other app developers may find ideas here that they could do something with. It’s of less interest, though, to anyone who wants to learn a bit of German.

Having spent a lot of time recently looking at vocabulary apps, I decided to put together a Christmas wish list of the features of my ideal vocabulary app. The list is not exhaustive and I’ve given more attention to some features than others. What (apart from testing) have I missed out?

1             Spaced repetition

Since the point of a vocabulary app is to help learners memorise vocabulary items, it is hard to imagine a decent system that does not incorporate spaced repetition. Spaced repetition algorithms offer one well-researched way of improving the brain’s ‘forgetting curve’. These algorithms come in different shapes and sizes, and I am not technically competent to judge which is the most efficient. However, as Peter Ellis Jones, the developer of a flashcard system called CardFlash, points out, efficiency is only one half of the rote memorisation problem. If you are not motivated to learn, the cleverness of the algorithm is moot. Fundamentally, learning software needs to be fun, rewarding, and give a solid sense of progression.

2             Quantity, balance and timing of new and ‘old’ items

A spaced repetition algorithm determines the optimum interval between repetitions, but further algorithms will be needed to determine when and with what frequency new items will be added to the deck. Once a system knows how many items a learner needs to learn and the time in which they have to do it, it is possible to determine the timing and frequency of the presentation of new items. But the system cannot know in advance how well an individual learner will learn the items (for any individual, some items will be more readily learnable than others) nor the extent to which learners will live up to their own positive expectations of time spent on-app. As most users of flashcard systems know, it is easy to fall behind, feel swamped and, ultimately, give up. An intelligent system needs to be able to respond to individual variables in order to ensure that the learning load is realistic.

3             Task variety

A standard flashcard system which simply asks learners to indicate whether they ‘know’ a target item before they flip over the card rapidly becomes extremely boring. A system which tests this knowledge soon becomes equally dull. There needs to be a variety of ways in which learners interact with an app, both for reasons of motivation and learning efficiency. It may be the case that, for an individual user, certain task types lead to more rapid gains in learning. An intelligent, adaptive system should be able to capture this information and modify the selection of task types.

Most younger learners and some adult learners will respond well to the inclusion of games within the range of task types. Examples of such games include the puzzles developed by Oliver Rose in his Phrase Maze app to accompany Quizlet practice.Phrase Maze 1Phrase Maze 2

4             Generative use

Memory researchers have long known about the ‘Generation Effect’ (see for example this piece of research from the Journal of Verbal Learning and Learning Behavior, 1978). Items are better learnt when the learner has to generate, in some (even small) way, the target item, rather than simply reading it. In vocabulary learning, this could be, for example, typing in the target word or, more simply, inserting some missing letters. Systems which incorporate task types that require generative use are likely to result in greater learning gains than simple, static flashcards with target items on one side and definitions or translations on the other.

5             Receptive and productive practice

The most basic digital flashcard systems require learners to understand a target item, or to generate it from a definition or translation prompt. Valuable as this may be, it won’t help learners much to use these items productively, since these systems focus exclusively on meaning. In order to do this, information must be provided about collocation, colligation, register, etc and these aspects of word knowledge will need to be focused on within the range of task types. At the same time, most vocabulary apps that I have seen focus primarily on the written word. Although any good system will offer an audio recording of the target item, and many will offer the learner the option of recording themselves, learners are invariably asked to type in their answers, rather than say them. For the latter, speech recognition technology will be needed. Ideally, too, an intelligent system will compare learner recordings with the audio models and provide feedback in such a way that the learner is guided towards a closer reproduction of the model.

6             Scaffolding and feedback

feebuMost flashcard systems are basically low-stakes, practice self-testing. Research (see, for example, Dunlosky et al’s metastudy ‘Improving Students’ Learning With Effective Learning Techniques: Promising Directions From Cognitive and Educational Psychology’) suggests that, as a learning strategy, practice testing has high utility – indeed, of higher utility than other strategies like keyword mnemonics or highlighting. However, an element of tutoring is likely to enhance practice testing, and, for this, scaffolding and feedback will be needed. If, for example, a learner is unable to produce a correct answer, they will probably benefit from being guided towards it through hints, in the same way as a teacher would elicit in a classroom. Likewise, feedback on why an answer is wrong (as opposed to simply being told that you are wrong), followed by encouragement to try again, is likely to enhance learning. Such feedback might, for example, point out that there is perhaps a spelling problem in the learner’s attempted answer, that the attempted answer is in the wrong part of speech, or that it is semantically close to the correct answer but does not collocate with other words in the text. The incorporation of intelligent feedback of this kind will require a number of NLP tools, since it will never be possible for a human item-writer to anticipate all the possible incorrect answers. A current example of intelligent feedback of this kind can be found in the Oxford English Vocabulary Trainer app.

7             Content

At the very least, a decent vocabulary app will need good definitions and translations (how many different languages?), and these will need to be tagged to the senses of the target items. These will need to be supplemented with all the other information that you find in a good learner’s dictionary: syntactic patterns, collocations, cognates, an indication of frequency, etc. The only way of getting this kind of high-quality content is by paying to license it from a company with expertise in lexicography. It doesn’t come cheap.

There will also need to be example sentences, both to illustrate meaning / use and for deployment in tasks. Dictionary databases can provide some of these, but they cannot be relied on as a source. This is because the example sentences in dictionaries have been selected and edited to accompany the other information provided in the dictionary, and not as items in practice exercises, which have rather different requirements. Once more, the solution doesn’t come cheap: experienced item writers will be needed.

Dictionaries describe and illustrate how words are typically used. But examples of typical usage tend to be as dull as they are forgettable. Learning is likely to be enhanced if examples are cognitively salient: weird examples with odd collocations, for example. Another thing for the item writers to think about.

A further challenge for an app which is not level-specific is that both the definitions and example sentences need to be level-specific. An A1 / A2 learner will need the kind of content that is found in, say, the Oxford Essential dictionary; B2 learners and above will need content from, say, the OALD.

8             Artwork and design

My wordbook2It’s easy enough to find artwork or photos of concrete nouns, but try to find or commission a pair of pictures that differentiate, for example, the adjectives ‘wild’ and ‘dangerous’ … What kind of pictures might illustrate simple verbs like ‘learn’ or ‘remember’? Will such illustrations be clear enough when squeezed into a part of a phone screen? Animations or very short video clips might provide a solution in some cases, but these are more expensive to produce and video files are much heavier.

With a few notable exceptions, such as the British Councils’s MyWordBook 2, design in vocabulary apps has been largely forgotten.

9             Importable and personalisable lists

Many learners will want to use a vocabulary app in association with other course material (e.g. coursebooks). Teachers, however, will inevitably want to edit these lists, deleting some items, adding others. Learners will want to do the same. This is a huge headache for app designers. If new items are going to be added to word lists, how will the definitions, example sentences and illustrations be generated? Will the database contain audio recordings of these words? How will these items be added to the practice tasks (if these include task types that go beyond simple double-sided flashcards)? NLP tools are not yet good enough to trawl a large corpus in order to select (and possibly edit) sentences that illustrate the right meaning and which are appropriate for interactive practice exercises. We can personalise the speed of learning and even the types of learning tasks, so long as the target language is predetermined. But as soon as we allow for personalisation of content, we run into difficulties.

10          Gamification

Maintaining motivation to use a vocabulary app is not easy. Gamification may help. Measuring progress against objectives will be a start. Stars and badges and leaderboards may help some users. Rewards may help others. But gamification features need to be built into the heart of the system, into the design and selection of tasks, rather than simply tacked on as an afterthought. They need to be trialled and tweaked, so analytics will be needed.

11          Teacher support

Although the use of vocabulary flashcards is beginning to catch on with English language teachers, teachers need help with ways to incorporate them in the work they do with their students. What can teachers do in class to encourage use of the app? In what ways does app use require teachers to change their approach to vocabulary work in the classroom? Reporting functions can help teachers know about the progress their students are making and provide very detailed information about words that are causing problems. But, as anyone involved in platform-based course materials knows, teachers need a lot of help.

12          And, of course, …

Apps need to be usable with different operating systems. Ideally, they should be (partially) usable offline. Loading times need to be short. They need to be easy and intuitive to use.

It’s unlikely that I’ll be seeing a vocabulary app with all of these features any time soon. Or, possibly, ever. The cost of developing something that could do all this would be extremely high, and there is no indication that there is a market that would be ready to pay the sort of prices that would be needed to cover the costs of development and turn a profit. We need to bear in mind, too, the fact that vocabulary apps can only ever assist in the initial acquisition of vocabulary: apps alone can’t solve the vocabulary learning problem (despite the silly claims of some app developers). The need for meaningful communicative use, extensive reading and listening, will not go away because a learner has been using an app. So, how far can we go in developing better and better vocabulary apps before users decide that a cheap / free app, with all its shortcomings, is actually good enough?

I posted a follow up to this post in October 2016.

MosaLingua  (with the obligatory capital letter in the middle) is a vocabulary app, available for iOS and Android. There are packages for a number of languages and English variations include general English, business English, vocabulary for TOEFL and vocabulary for TOEIC. The company follows the freemium model, with free ‘Lite’ versions and fuller content selling for €4.99. I tried the ‘Lite’ general English app, opting for French as my first language. Since the app is translation-based, you need to have one of the language pairings that are on offer (the other languages are currently Italian, Spanish, Portuguese and German).Mosalingua

The app I looked at is basically a phrase book with spaced repetition. Even though this particular app was general English, it appeared to be geared towards the casual business traveller. It uses the same algorithm as Anki, and users are taken through a sequence of (1) listening to an audio recording of the target item (word or phrase) along with the possibility of comparing a recording of yourself with the recording provided, (2) standard bilingual flashcard practice, (3) a practice stage where you are given the word or phrase in your own language and you have to unscramble words or letters to form the equivalent in English, and (4) a self-evaluation stage where users select from one of four options (“review”, “hard”, “good”, “perfect”) where the choice made will influence the re-presentation of the item within the spaced repetition.

In addition to these words and phrases, there are a number of dialogues where you (1) listen to the dialogue (‘without worrying about understanding everything’), (2) are re-exposed to the dialogue with English subtitles, (3) see it again with subtitles in your own language, (4) practise it with standard flashcards.

The developers seem to be proud of their Mosa Learning Method®: they’ve registered this as a trademark. At its heart is spaced repetition. This is supplemented by what they refer to as ‘Active Recall’, the notion that things are better memorised if the learner has to make some sort of cognitive effort, however minimal, in recalling the target items. The principle is, at least to me, unquestionable, but the realisation (unjumbling words or letters) becomes rather repetitive and, ultimately, tedious. Then, there is what they call ‘metacognition’. Again, this is informed by research, even if the realisation (self-evaluation of learning difficulty into four levels) is extremely limited. Then there is the Pareto principle  – the 80-20 rule. I couldn’t understand the explanation of what this has to do with the trademarked method. Here’s the MosaLingua explanation  – figure it out for yourself:

Did you know that the 100 most common words in English account for half of the written corpus?

Evidently, you shouldn’t quit after learning only 100 words. Instead, you should concentrate on the most frequently used words and you’ll make spectacular progress. What’s more, globish (global English) has shown that it’s possible to express yourself using only 1500 well-chosen words (which would take less than 3 months with only 10 minutes per day with MosaLingua). Once you’ve acquired this base, MosaLingua proposes specialized vocabulary suited to your needs (the application has over 3000 words).

Finally, there’s some stuff about motivation and learner psychology. This boils down to That’s why we offer free learning help via email, presenting the Web’s best resources, as well as tips through bonus material or the learning community on the MosaLingua blog. We’ll give you all the tools you need to develop your own personalized learning method that is adapted to your needs. Some of these tips are not at all bad, but there’s precious little in the way of gamification or other forms of easy motivation.

In short, it’s all reasonably respectable, despite the predilection for sciency language in the marketing blurb. But what really differentiates this product from Anki, as the founder, Samuel Michelot, points out is the content. Mosalingua has lists of vocabulary and phrases that were created by professors. The word ‘professors’ set my alarm bells ringing, and I wasn’t overly reassured when all I could find out about these ‘professors’ was the information about the MosaLingua team .professors

Despite what some people  claim, content is, actually, rather important when it comes to language learning. I’ll leave you with some examples of MosaLingua content (one dialogue and a selection of words / phrases organised by level) and you can make up your own mind.

Dialogue

Hi there, have a seat. What seems to be the problem?

I haven’t been feeling well since this morning. I have a very bad headache and I feel sick.

Do you feel tired? Have you had cold sweats?

Yes, I’m very tired and have had cold sweats. I have been feeling like that since this morning.

Have you been out in the sun?

Yes, this morning I was at the beach with my friends for a couple hours.

OK, it’s nothing serious. It’s just a bad case of sunstroke. You must drink lots of water and rest. I’ll prescribe you something for the headache and some after sun lotion.

Great, thank you, doctor. Bye.

You’re welcome. Bye.

Level 1: could you help me, I would like a …, I need to …, I don’t know, it’s okay, I (don’t) agree, do you speak English, to drink, to sleep, bank, I’m going to call the police

Level 2: I’m French, cheers, can you please repeat that, excuse me how can I get to …, map, turn left, corner, far (from), distance, thief, can you tell me where I can find …

Level 3: what does … mean, I’m learning English, excuse my English, famous, there, here, until, block, from, to turn, street corner, bar, nightclub, I have to be at the airport tomorrow morning

Level 4: OK, I’m thirty (years old), I love this country, how do you say …, what is it, it’s a bit like …, it’s a sort of …, it’s as small / big as …, is it far, where are we, where are we going, welcome, thanks but I can’t, how long have you been here, is this your first trip to England, take care, district / neighbourhood, in front (of)

Level 5: of course, can I ask you a question, you speak very well, I can’t find the way, David this is Julia, we meet at last, I would love to, where do you want to go, maybe another day, I’ll miss you, leave me alone, don’t touch me, what’s you email

Level 6: I’m here on a business trip, I came with some friends, where are the nightclubs, I feel like going to a bar, I can pick you up at your house, let’s go to see a movie, we had a lot of fun, come again, thanks for the invitation

Adaptive learning providers make much of their ability to provide learners with personalised feedback and to provide teachers with dashboard feedback on the performance of both individuals and groups. All well and good, but my interest here is in the automated feedback that software could provide on very specific learning tasks. Scott Thornbury, in a recent talk, ‘Ed Tech: The Mouse that Roared?’, listed six ‘problems’ of language acquisition that educational technology for language learning needs to address. One of these he framed as follows: ‘The feedback problem, i.e. how does the learner get optimal feedback at the point of need?’, and suggested that technological applications ‘have some way to go.’ He was referring, not to the kind of feedback that dashboards can provide, but to the kind of feedback that characterises a good language teacher: corrective feedback (CF) – the way that teachers respond to learner utterances (typically those containing errors, but not necessarily restricted to these) in what Ellis and Shintani call ‘form-focused episodes’[1]. These responses may include a direct indication that there is an error, a reformulation, a request for repetition, a request for clarification, an echo with questioning intonation, etc. Basically, they are correction techniques.

These days, there isn’t really any debate about the value of CF. There is a clear research consensus that it can aid language acquisition. Discussing learning in more general terms, Hattie[2] claims that ‘the most powerful single influence enhancing achievement is feedback’. The debate now centres around the kind of feedback, and when it is given. Interestingly, evidence[3] has been found that CF is more effective in the learning of discrete items (e.g. some grammatical structures) than in communicative activities. Since it is precisely this kind of approach to language learning that we are more likely to find in adaptive learning programs, it is worth exploring further.

What do we know about CF in the learning of discrete items? First of all, it works better when it is explicit than when it is implicit (Li, 2010), although this needs to be nuanced. In immediate post-tests, explicit CF is better than implicit variations. But over a longer period of time, implicit CF provides better results. Secondly, formative feedback (as opposed to right / wrong testing-style feedback) strengthens retention of the learning items: this typically involves the learner repairing their error, rather than simply noticing that an error has been made. This is part of what cognitive scientists[4] sometimes describe as the ‘generation effect’. Whilst learners may benefit from formative feedback without repairing their errors, Ellis and Shintani (2014: 273) argue that the repair may result in ‘deeper processing’ and, therefore, assist learning. Thirdly, there is evidence that some delay in receiving feedback aids subsequent recall, especially over the longer term. Ellis and Shintani (2014: 276) suggest that immediate CF may ‘benefit the development of learners’ procedural knowledge’, while delayed CF is ‘perhaps more likely to foster metalinguistic understanding’. You can read a useful summary of a meta-analysis of feedback effects in online learning here, or you can buy the whole article here.

I have yet to see an online language learning program which can do CF well, but I think it’s a matter of time before things improve significantly. First of all, at the moment, feedback is usually immediate, or almost immediate. This is unlikely to change, for a number of reasons – foremost among them being the pride that ed tech takes in providing immediate feedback, and the fact that online learning is increasingly being conceptualised and consumed in bite-sized chunks, something you do on your phone between doing other things. What will change in better programs, however, is that feedback will become more formative. As things stand, tasks are usually of a very closed variety, with drag-and-drop being one of the most popular. Only one answer is possible and feedback is usually of the right / wrong-and-here’s-the-correct-answer kind. But tasks of this kind are limited in their value, and, at some point, tasks are needed where more than one answer is possible.

Here’s an example of a translation task from Duolingo, where a simple sentence could be translated into English in quite a large number of ways.

i_am_doing_a_basketDecontextualised as it is, the sentence could be translated in the way that I have done it, although it’s unlikely. The feedback, however, is of relatively little help to the learner, who would benefit from guidance of some sort. The simple reason that Duolingo doesn’t offer useful feedback is that the programme is static. It has been programmed to accept certain answers (e.g. in this case both the present simple and the present continuous are acceptable), but everything else will be rejected. Why? Because it would take too long and cost too much to anticipate and enter in all the possible answers. Why doesn’t it offer formative feedback? Because in order to do so, it would need to identify the kind of error that has been made. If we can identify the kind of error, we can make a reasonable guess about the cause of the error, and select appropriate CF … this is what good teachers do all the time.

Analysing the kind of error that has been made is the first step in providing appropriate CF, and it can be done, with increasing accuracy, by current technology, but it requires a lot of computing. Let’s take spelling as a simple place to start. If you enter ‘I am makeing a basket for my mother’ in the Duolingo translation above, the program tells you ‘Nice try … there’s a typo in your answer’. Given the configuration of keyboards, it is highly unlikely that this is a typo. It’s a simple spelling mistake and teachers recognise it as such because they see it so often. For software to achieve the same insight, it would need, as a start, to trawl a large English dictionary database and a large tagged database of learner English. The process is quite complicated, but it’s perfectably do-able, and learners could be provided with CF in the form of a ‘spelling hint’.i_am_makeing_a_basket

Rather more difficult is the error illustrated in my first screen shot. What’s the cause of this ‘error’? Teachers know immediately that this is probably a classic confusion of ‘do’ and ‘make’. They know that the French verb ‘faire’ can be translated into English as ‘make’ or ‘do’ (among other possibilities), and the error is a common language transfer problem. Software could do the same thing. It would need a large corpus (to establish that ‘make’ collocates with ‘a basket’ more often than ‘do’), a good bilingualised dictionary (plenty of these now exist), and a tagged database of learner English. Again, appropriate automated feedback could be provided in the form of some sort of indication that ‘faire’ is only sometimes translated as ‘make’.

These are both relatively simple examples, but it’s easy to think of others that are much more difficult to analyse automatically. Duolingo rejects ‘I am making one basket for my mother’: it’s not very plausible, but it’s not wrong. Teachers know why learners do this (again, it’s probably a transfer problem) and know how to respond (perhaps by saying something like ‘Only one?’). Duolingo also rejects ‘I making a basket for my mother’ (a common enough error), but is unable to provide any help beyond the correct answer. Automated CF could, however, be provided in both cases if more tools are brought into play. Multiple parsing machines (one is rarely accurate enough on its own) and semantic analysis will be needed. Both the range and the complexity of the available tools are increasing so rapidly (see here for the sort of research that Google is doing and here for an insight into current applications of this research in language learning) that Duolingo-style right / wrong feedback will very soon seem positively antediluvian.

One further development is worth mentioning here, and it concerns feedback and gamification. Teachers know from the way that most learners respond to written CF that they are usually much more interested in knowing what they got right or wrong, rather than the reasons for this. Most students are more likely to spend more time looking at the score at the bottom of a corrected piece of written work than at the laborious annotations of the teacher throughout the text. Getting students to pay close attention to the feedback we provide is not easy. Online language learning systems with gamification elements, like Duolingo, typically reward learners for getting things right, and getting things right in the fewest attempts possible. They encourage learners to look for the shortest or cheapest route to finding the correct answers: learning becomes a sexed-up form of test. If, however, the automated feedback is good, this sort of gamification encourages the wrong sort of learning behaviour. Gamification designers will need to shift their attention away from the current concern with right / wrong, and towards ways of motivating learners to look at and respond to feedback. It’s tricky, because you want to encourage learners to take more risks (and reward them for doing so), but it makes no sense to penalise them for getting things right. The probable solution is to have a dual points system: one set of points for getting things right, another for employing positive learning strategies.

The provision of automated ‘optimal feedback at the point of need’ may not be quite there yet, but it seems we’re on the way for some tasks in discrete-item learning. There will probably always be some teachers who can outperform computers in providing appropriate feedback, in the same way that a few top chess players can beat ‘Deep Blue’ and its scions. But the rest of us had better watch our backs: in the provision of some kinds of feedback, computers are catching up with us fast.

[1] Ellis, R. & N. Shintani (2014) Exploring Language Pedagogy through Second Language Acquisition Research. Abingdon: Routledge p. 249

[2] Hattie, K. (2009) Visible Learning. Abingdon: Routledge p.12

[3] Li, S. (2010) ‘The effectiveness of corrective feedback in SLA: a meta-analysis’ Language Learning 60 / 2: 309 -365

[4] Brown, P.C., Roediger, H.L. & McDaniel, M. A. Make It Stick (Cambridge, Mass.: Belknap Press, 2014)