Posts Tagged ‘Memrise’

Digital flashcard systems like Memrise and Quizlet remain among the most popular language learning apps. Their focus is on the deliberate learning of vocabulary, an approach described by Paul Nation (Nation, 2005) as ‘one of the least efficient ways of developing learners’ vocabulary knowledge but nonetheless […] an important part of a well-balanced vocabulary programme’. The deliberate teaching of vocabulary also features prominently in most platform-based language courses.

For both vocabulary apps and bigger courses, the lexical items need to be organised into sets for the purposes of both presentation and practice. A common way of doing this, especially at lower levels, is to group the items into semantic clusters (sets with a classifying superordinate, like body part, and a collection of example hyponyms, like arm, leg, head, chest, etc.).

The problem, as Keith Folse puts it, is that such clusters ‘are not only unhelpful, they actually hinder vocabulary retention’ (Folse, 2004: 52). Evidence for this claim may be found in Higa (1963), Tinkham (1993, 1997), Waring (1997), Erten & Tekin (2008) and Barcroft (2015), to cite just some of the more well-known studies. The results, says Folse, ‘are clear and, I think, very conclusive’. The explanation that is usually given draws on interference theory: semantic similarity may lead to confusion (e.g. when learners mix up days of the week, colour words or adjectives to describe personality).

It appears, then, to be long past time to get rid of semantic clusters in language teaching. Well … not so fast. First of all, although most of the research sides with Folse, not all of it does. Nakata and Suzuki (2019) in their survey of more recent research found that results were more mixed. They found one study which suggested that there was no significant difference in learning outcomes between presenting words in semantic clusters and semantically unrelated groups (Ishii, 2015). And they found four studies (Hashemi & Gowdasiaei, 2005; Hoshino, 2010; Schneider, Healy, & Bourne, 1998, 2002) where semantic clusters had a positive effect on learning.

Nakata and Suzuki (2019) offer three reasons why semantic clustering might facilitate vocabulary learning: it (1) ‘reflects how vocabulary is stored in the mental lexicon, (2) introduces desirable difficulty, and (3) leads to extra attention, effort, or engagement from learners’. Finkbeiner and Nicol (2003) make a similar point: ‘although learning semantically related words appears to take longer, it is possible that words learned under these conditions are learned better for the purpose of actual language use (e.g., the retrieval of vocabulary during production and comprehension). That is, the very difficulty associated with learning the new labels may make them easier to process once they are learned’. Both pairs of researcher cited in this paragraph conclude that semantic clusters are best avoided, but their discussion of the possible benefits of this clustering is a recognition that the research (for reasons which I will come on to) cannot lead to categorical conclusions.

The problem, as so often with pedagogical research, is the gap between research conditions and real-world classrooms. Before looking at this in a little more detail, one relatively uncontentious observation can be made. Even those scholars who advise against semantic clustering (e.g. Papathanasiou, 2009), acknowledge that the situation is complicated by other factors, especially the level of proficiency of the learner and whether or not one or more of the hyponyms are known to the learner. At higher levels (when it is more likely that one or more of the hyponyms are already, even partially, known), semantic clustering is not a problem. I would add that, on the whole at higher levels, the deliberate learning of vocabulary is even less efficient than at lower levels and should be an increasingly small part of a well-balanced vocabulary programme.

So, why is there a problem drawing practical conclusions from the research? In order to have any scientific validity at all, researchers need to control a large number of variable. They need, for example, to be sure that learners do not already know any of the items that are being presented. The only practical way of doing this is to present sets of invented words, and this is what most of the research does (Sarioğlu, 2018). These artificial words solve one problem, but create others, the most significant of which is item difficulty. Many factors impact on item difficulty, and these include word frequency (obviously a problem with invented words), word length, pronounceability and the familiarity and length of the corresponding item in L1. None of the studies which support the abandonment of semantic clusters have controlled all of these variables (Nakata and Suzuki, 2019). Indeed, it would be practically impossible to do so. Learning pseudo-words is a very different proposition to learning real words, which a learner may subsequently encounter or want to use.

Take, for example, the days of the week. It’s quite common for learners to muddle up Tuesday and Thursday. The reason for this is not just semantic similarity (Tuesday and Monday are less frequently confused). They are also very similar in terms of both spelling and pronunciation. They are ‘synforms’ (see Laufer, 2009), which, like semantic clusters, can hinder learning of new items. But, now imagine a French-speaking learner of Spanish studying the days of the week. It is much less likely that martes and jueves will be muddled, because of their similarity to the French words mardi and jeudi. There would appear to be no good reason not to teach the complete set of days of the week to a learner like this. All other things being equal, it is probably a good idea to avoid semantic clusters, but all other things are very rarely equal.

Again, in an attempt to control for variables, researchers typically present the target items in isolation (in bilingual pairings). But, again, the real world does not normally conform to this condition. Leo Sellivan (2014) suggests that semantic clusters (e.g. colours) are taught as part of collocations. He gives the examples of red dress, green grass and black coffee, and points out that the alliterative patterns can serve as mnemonic devices which will facilitate learning. The suggestion is, I think, a very good one, but, more generally, it’s worth noting that the presentation of lexical items in both digital flashcards and platform courses is rarely context-free. Contexts will inevitably impact on learning and may well obviate the risks of semantic clustering.

Finally, this kind of research typically gives participants very restricted time to memorize the target words (Sarioğlu, 2018) and they are tested in very controlled recall tasks. In the case of language platform courses, practice of target items is usually spread out over a much longer period of time, with a variety of exposure opportunities (in controlled practice tasks, exposure in texts, personalisation tasks, revision exercises, etc.) both within and across learning units. In this light, it is not unreasonable to argue that laboratory-type research offers only limited insights into what should happen in the real world of language learning and teaching. The choice of learning items, the way they are presented and practised, and the variety of activities in the well-balanced vocabulary programme are probably all more significant than the question of whether items are organised into semantic clusters.

Although semantic clusters are quite common in language learning materials, much more common are thematic clusters (i.e. groups of words which are topically related, but include a variety of parts of speech (see below). Researchers, it seems, have no problem with this way of organising lexical sets. By way of conclusion, here’s an extract from a recent book:

‘Introducing new words together that are similar in meaning (synonyms), such as scared and frightened, or forms (synforms), like contain and maintain, can be confusing, and students are less likely to remember them. This problem is known as ‘interference’. One way to avoid this is to choose words that are around the same theme, but which include a mix of different parts of speech. For example, if you want to focus on vocabulary to talk about feelings, instead of picking lots of adjectives (happy, sad, angry, scared, frightened, nervous, etc.) include some verbs (feel, enjoy, complain) and some nouns (fun, feelings, nerves). This also encourages students to use a variety of structures with the vocabulary.’ (Hughes, et al., 2015: 25)



Barcroft, J. 2015. Lexical Input Processing and Vocabulary Learning. Amsterdam: John Benjamins

Erten, I.H., & Tekin, M. 2008. Effects on vocabulary acquisition of presenting new words in semantic sets versus semantically-unrelated sets. System, 36 (3), 407-422

Finkbeiner, M. & Nicol, J. 2003. Semantic category effects in second language word learning. Applied Psycholinguistics 24 (2003), 369–383

Folse, K. S. 2004. Vocabulary Myths. Ann Arbor: University of Michigan Press

Hashemi, M.R., & Gowdasiaei, F. 2005. An attribute-treatment interaction study: Lexical-set versus semantically-unrelated vocabulary instruction. RELC Journal, 36 (3), 341-361

Higa, M. 1963. Interference effects of intralist word relationships in verbal learning. Journal of Verbal Learning and Verbal Behavior, 2, 170-175

Hoshino, Y. 2010. The categorical facilitation effects on L2 vocabulary learning in a classroom setting. RELC Journal, 41, 301–312

Hughes, S. H., Mauchline, F. & Moore, J. 2019. ETpedia Vocabulary. Shoreham-by-Sea: Pavilion Publishing and Media

Ishii, T. 2015. Semantic connection or visual connection: Investigating the true source of confusion. Language Teaching Research, 19, 712–722

Laufer, B. 2009. The concept of ‘synforms’ (similar lexical forms) in vocabulary acquisition. Language and Education, 2 (2): 113 – 132

Nakata, T. & Suzuki, Y. 2019. Effects Of Massing And Spacing On The Learning Of Semantically Related And Unrelated Words. Studies in Second Language Acquisition 41 (2), 287 – 311

Nation, P. 2005. Teaching Vocabulary. Asian EFL Journal.

Papathanasiou, E. 2009. An investigation of two ways of presenting vocabulary. ELT Journal 63 (4), 313 – 322

Sarioğlu, M. 2018. A Matter of Controversy: Teaching New L2 Words in Semantic Sets or Unrelated Sets. Journal of Higher Education and Science Vol 8 / 1: 172 – 183

Schneider, V. I., Healy, A. F., & Bourne, L. E. 1998. Contextual interference effects in foreign language vocabulary acquisition and retention. In Healy, A. F. & Bourne, L. E. (Eds.), Foreign language learning: Psycholinguistic studies on training and retention (pp. 77–90). Mahwah, NJ: Erlbaum

Schneider, V. I., Healy, A. F., & Bourne, L. E. 2002. What is learned under difficult conditions is hard to forget: Contextual interference effects in foreign vocabulary acquisition, retention, and transfer. Journal of Memory and Language, 46, 419–440

Sellivan, L. 2014. Horizontal alternatives to vertical lists. Blog post:

Tinkham, T. 1993. The effect of semantic clustering on the learning of second language vocabulary. System 21 (3), 371-380.

Tinkham, T. 1997. The effects of semantic and thematic clustering on the learning of a second language vocabulary. Second Language Research, 13 (2),138-163

Waring, R. 1997. The negative effects of learning words in semantic sets: a replication. System, 25 (2), 261 – 274


A personalized language learning programme that is worth its name needs to offer a wide variety of paths to accommodate the varying interests, priorities, levels and preferred approaches to learning of the users of the programme. For this to be possible, a huge quantity of learning material is needed (Iwata et al., 2011: 1): the preparation and curation of this material is extremely time-consuming and expensive (despite the pittance that is paid to writers and editors). It’s not surprising, then, that a growing amount of research is being devoted to the exploration of ways of automatically generating language learning material. One area that has attracted a lot of attention is the learning of vocabulary.

Memrise screenshot 2Many simple vocabulary learning tasks are relatively simple to generate automatically. These include matching tasks of various kinds, such as the matching of words or phrases to meanings (either in English or the L1), pictures or collocations, as in many flashcard apps. Doing it well is rather harder: the definitions or translations have to be good and appropriate for learners of the level, the pictures need to be appropriate. If, as is often the case, the lexical items have come from a text or form part of a group of some kind, sense disambiguation software will be needed to ensure that the right meaning is being practised. Anyone who has used flashcard apps knows that the major problem is usually the quality of the content (whether it has been automatically generated or written by someone).

A further challenge is the generation of distractors. In the example here (from Memrise), the distractors have been so badly generated as to render the task more or less a complete waste of time. Distractors must, in some way, be viable alternatives (Smith et al., 2010) but still clearly wrong. That means they should normally be the same part of speech and true cognates should be avoided. Research into the automatic generation of distractors is well-advanced (see, for instance, Kumar at al., 2015) with Smith et al (2010), for example, using a very large corpus and various functions of Sketch Engine (the most well-known corpus query tool) to find collocates and other distractors. Their TEDDCLOG (Testing English with Data-Driven CLOze Generation) system produced distractors that were deemed acceptable 91% of the time. Whilst impressive, there is still a long way to go before human editing / rewriting is no longer needed.

One area that has attracted attention is, of course, tests, and some tasks, such as those in TOEFL (see image). Susanti et al (2015, 2017) were able, given a target word, to automatically generate a reading passage from web sources along with questions of the TOEFL kind. However, only about half of them were considered good enough to be used in actual tests. Again, that is some way off avoiding human intervention altogether, but the automatically generated texts and questions can greatly facilitate the work of human item writers.

toefl task


Other tools that might be useful include the University of Nottingham AWL (Academic Word List) Gapmaker . This allows users to type or paste in a text, from which items from the AWL are extracted and replaced as a gap. See the example below. It would, presumably, not be too difficult, to combine this approach with automatic distractor generation and to create multiple choice tasks.


WordGapThere are a number of applications that offer the possibility of generating cloze tasks from texts selected by the user (learner or teacher). These have not always been designed with the language learner in mind but one that was is the Android app, WordGap (Knoop & Wilske, 2013). Described by its developers as a tool that ‘provides highly individualized exercises to support contextualized mobile vocabulary learning …. It matches the interests of the learner and increases the motivation to learn’. It may well do all that, but then again, perhaps not. As Knoop & Wilske acknowledge, it is only appropriate for adult, advanced learners and its value as a learning task is questionable. The target item that has been automatically selected is ‘novel’, a word that features in the list Oxford 2000 Keywords (as do all three distractors), and therefore ought to be well below the level of the users. Some people might find this fun, but, in terms of learning, they would probably be better off using an app that made instant look-up of words in the text possible.

More interesting, in my view, is TEDDCLOG (Smith et al., 2010), a system that, given a target learning item (here the focus is on collocations), trawls a large corpus to find the best sentence that illustrates it. ‘Good sentences’ were defined as those which were short (but not too short, or there is not enough useful context, begins with a capital letter and ends with a full stop, has a maximum of two commas; and otherwise contains only the 26 lowercase letters. It must be at a lexical and grammatical level that an intermediate level learner of English could be expected to understand. It must be well-formed and without too much superfluous material. All others were rejected. TEDDCLOG uses Sketch Engine’s GDEX function (Good Dictionary Example Extractor, Kilgarriff et al 2008) to do this.

My own interest in this area came about as a result of my work in the development of the Oxford Vocabulary Trainer . The app offers the possibility of studying both pre-determined lexical items (e.g. the vocabulary list of a coursebook that the learner is using) and free choice (any item could be activated and sent to a learning queue). In both cases, practice takes the form of sentences with the target item gapped. There are a range of hints and help options available to the learner, and feedback is both automatic and formative (i.e. if the supplied answer is not correct, hints are given to push the learner to do better on a second attempt). Leveraging some fairly heavy technology, we were able to achieve a fair amount of success in the automation of intelligent feedback, but what had, at first sight, seemed a lesser challenge – the generation of suitable ‘carrier sentences’, proved more difficult.

The sentences which ‘carry’ the gap should, ideally, be authentic: invented examples often ‘do not replicate the phraseology and collocational preferences of naturally-occurring text’ (Smith et al., 2010). The technology of corpus search tools should allow us to do a better job than human item writers. For that to be the case, we need not only good search tools but a good corpus … and some are better than others for the purposes of language learning. As Fenogenova & Kuzmenko (2016) discovered when using different corpora to automatically generate multiple choice vocabulary exercises, the British Academic Written English corpus (BAWE) was almost 50% more useful than the British National Corpus (BNC). In the development of the Oxford Vocabulary Trainer, we thought we had the best corpus we could get our hands on – the tagged corpus used for the production of the Oxford suite of dictionaries. We could, in addition and when necessary, turn to other corpora, including the BAWE and the BNC. Our requirements for acceptable carrier sentences were similar to those of Smith et al (2010), but were considerably more stringent.

To cut quite a long story short, we learnt fairly quickly that we simply couldn’t automate the generation of carrier sentences with sufficient consistency or reliability. As with some of the other examples discussed in this post, we were able to use the technology to help the writers in their work. We also learnt (rather belatedly, it has to be admitted) that we were trying to find technological solutions to problems that we hadn’t adequately analysed at the start. We hadn’t, for example, given sufficient thought to learner differences, especially the role of L1 (and other languages) in learning English. We hadn’t thought enough about the ‘messiness’ of either language or language learning. It’s possible, given enough resources, that we could have found ways of improving the algorithms, of leveraging other tools, or of deploying additional databases (especially learner corpora) in our quest for a personalised vocabulary learning system. But, in the end, it became clear to me that we were only nibbling at the problem of vocabulary learning. Deliberate learning of vocabulary may be an important part of acquiring a language, but it remains only a relatively small part. Technology may be able to help us in a variety of ways (and much more so in testing than learning), but the dreams of the data scientists (who wrote much of the research cited here) are likely to be short-lived. Experienced writers and editors of learning materials will be needed for the foreseeable future. And truly personalized vocabulary learning, fully supported by technology, will not be happening any time soon.



Fenogenova, A. & Kuzmenko, E. 2016. Automatic Generation of Lexical Exercises Available online at

Iwata, T., Goto, T., Kojiri, T., Watanabe, T. & T. Yamada. 2011. ‘Automatic Generation of English Cloze Questions Based on Machine Learning’. NTT Technical Review Vol. 9 No. 10 Oct. 2011

Kilgarriff, A. et al. 2008. ‘GDEX: Automatically Finding Good Dictionary Examples in a Corpus.’ In E. Bernal and J. DeCesaris (eds.), Proceedings of the XIII EURALEX International Congress: Barcelona, 15-19 July 2008. Barcelona: l’Institut Universitari de Lingüística Aplicada (IULA) dela Universitat Pompeu Fabra, 425–432.

Knoop, S. & Wilske, S. 2013. ‘WordGap – Automatic generation of gap-filling vocabulary exercises for mobile learning’. Proceedings of the second workshop on NLP for computer-assisted language learning at NODALIDA 2013. NEALT Proceedings Series 17 / Linköping Electronic Conference Proceedings 86: 39–47. Available online at

Kumar, G., Banchs, R.E. & D’Haro, L.F. 2015. ‘RevUP: Automatic Gap-Fill Question Generation from Educational Texts’. Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 2015, pp. 154–161, Denver, Colorado, June 4, Association for Computational Linguistics

Smith, S., Avinesh, P.V.S. & Kilgariff, A. 2010. ‘Gap-fill tests for Language Learners: Corpus-Driven Item Generation’. Proceedings of ICON-2010: 8th International Conference on Natural Language Processing, Macmillan Publishers, India. Available online at

Susanti, Y., Iida, R. & Tokunaga, T. 2015. ‘Automatic Generation of English Vocabulary Tests’. Proceedings of 7th International Conference on Computer Supported Education. Available online

Susanti, Y., Tokunaga, T., Nishikawa, H. & H. Obari 2017. ‘Evaluation of automatically generated English vocabulary questions’ Research and Practice in Technology Enhanced Learning 12 / 11


I have been putting in a lot of time studying German vocabulary with Memrise lately, but this is not a review of the Memrise app. For that, I recommend you read Marek Kiczkowiak’s second post on this app. Like me, he’s largely positive, although I am less enthusiastic about Memrise’s USP, the use of mnemonics. It’s not that mnemonics don’t work – there’s a lot of evidence that they do: it’s just that there is little or no evidence that they’re worth the investment of time.

Time … as I say, I have been putting in the hours. Every day, for over a month, averaging a couple of hours a day, it’s enough to get me very near the top of the leader board (which I keep a very close eye on) and it means that I am doing more work than 99% of other users. And, yes, my German is improving.

Putting in the time is the sine qua non of any language learning and a well-designed app must motivate users to do this. Relevant content will be crucial, as will satisfactory design, both visual and interactive. But here I’d like to focus on the two other key elements: task design / variety and gamification.

Memrise offers a limited range of task types: presentation cards (with word, phrase or sentence with translation and audio recording), multiple choice (target item with four choices), unscrambling letters or words, and dictation (see below).


As Marek writes, it does get a bit repetitive after a while (although less so than thumbing through a pack of cardboard flashcards). The real problem, though, is that there are only so many things an app designer can do with standard flashcards, if they are to contribute to learning. True, there could be a few more game-like tasks (as with Quizlet), races against the clock as you pop word balloons or something of the sort, but, while these might, just might, help with motivation, these games rarely, if ever, contribute much to learning.

What’s more, you’ll get fed up with the games sooner or later if you’re putting in serious study hours. Even if Memrise were to double the number of activity types, I’d have got bored with them by now, in the same way I got bored with the Quizlet games. Bear in mind, too, that I’ve only done a month: I have at least another two months to go before I finish the level I’m working on. There’s another issue with ‘fun’ activities / games which I’ll come on to later.

The options for task variety in vocabulary / memory apps are therefore limited. Let’s look at gamification. Memrise has leader boards (weekly, monthly, ‘all time’), streak badges, daily goals, email reminders and (in the laptop and premium versions) a variety of graphs that allow you to analyse your study patterns. Your degree of mastery of learning items is represented by a growing flower that grows leaves, flowers and withers. None of this is especially original or different from similar apps.

Screenshot_2016-05-24-19-17-14The trouble with all of this is that it can only work for a certain time and, for some people, never. There’s always going to be someone like me who can put in a couple of hours a day more than you can. Or someone, in my case, like ‘Nguyenduyha’, who must be doing about four hours a day, and who, I know, is out of my league. I can’t compete and the realisation slowly dawns that my life would be immeasurably sadder if I tried to.

Having said that, I have tried to compete and the way to do so is by putting in the time on the ‘speed review’. This is the closest that Memrise comes to a game. One hundred items are flashed up with four multiple choices and these are against the clock. The quicker you are, the more points you get, and if you’re too slow, or you make a mistake, you lose a life. That’s how you gain lots of points with Memrise. The problem is that, at best, this task only promotes receptive knowledge of the items, which is not what I need by this stage. At worst, it serves no useful learning function at all because I have learnt ways of doing this well which do not really involve me processing meaning at all. As Marek says in his post (in reference to Quizlet), ‘I had the feeling that sometimes I was paying more attention to ‘winning’ the game and scoring points, rather than to the words on the screen.’ In my case, it is not just a feeling: it’s an absolute certainty.


Sadly, the gamification is working against me. The more time I spend on the U-Bahn doing Memrise, the less time I spend reading the free German-language newspapers, the less time I spend eavesdropping on conversations. Two hours a day is all I have time for for my German study, and Memrise is eating it all up. I know that there are other, and better, ways of learning. In order to do what I know I should be doing, I need to ignore the gamification. For those, more reasonable, students, who can regularly do their fifteen minutes a day, day in – day out, the points and leader boards serve no real function at all.

Cheating at gamification, or gaming the system, is common in app-land. A few years ago, Memrise had to take down their leader board when they realised that cheating was taking place. There’s an inexorable logic to this: gamification is an attempt to motivate by rewarding through points, rather than the reward coming from the learning experience. The logic of the game overtakes itself. Is ‘Nguyenduyha’ cheating, or do they simply have nothing else to do all day? Am I cheating by finding time to do pointless ‘speed reviews’ that earn me lots of points?

For users like myself, then, gamification design needs to be a delicate balancing act. For others, it may be largely an irrelevance. I’ve been working recently on a general model of vocabulary app design that looks at two very different kinds of user. On the one hand, there are the self-motivated learners like myself or the millions of other who have chosen to use self-study apps. On the other, there are the millions of students in schools and colleges, studying English among other subjects, some of whom are now being told to use the vocabulary apps that are beginning to appear packaged with their coursebooks (or other learning material). We’ve never found entirely satisfactory ways of making these students do their homework, and the fact that this homework is now digital will change nothing (except, perhaps, in the very, very short term). The incorporation of games and gamification is unlikely to change much either: there will always be something more interesting and motivating (and unconnected with language learning) elsewhere.

Teachers and college principals may like the idea of gamification (without having really experienced it themselves) for their students. But more important for most of them is likely to be the teacher dashboard: the means by which they can check that their students are putting the time in. Likewise, they will see the utility of automated email reminders that a student is not working hard enough to meet their learning objectives, more and more regular tests that contribute to overall course evaluation, comparisons with college, regional or national benchmarks. Technology won’t solve the motivation issue, but it does offer efficient means of control.