Automated generation of vocabulary learning materials

Posted: March 24, 2018 in Personalization, research
Tags: , , , , , , , , , , , ,

A personalized language learning programme that is worth its name needs to offer a wide variety of paths to accommodate the varying interests, priorities, levels and preferred approaches to learning of the users of the programme. For this to be possible, a huge quantity of learning material is needed (Iwata et al., 2011: 1): the preparation and curation of this material is extremely time-consuming and expensive (despite the pittance that is paid to writers and editors). It’s not surprising, then, that a growing amount of research is being devoted to the exploration of ways of automatically generating language learning material. One area that has attracted a lot of attention is the learning of vocabulary.

Memrise screenshot 2Many simple vocabulary learning tasks are relatively simple to generate automatically. These include matching tasks of various kinds, such as the matching of words or phrases to meanings (either in English or the L1), pictures or collocations, as in many flashcard apps. Doing it well is rather harder: the definitions or translations have to be good and appropriate for learners of the level, the pictures need to be appropriate. If, as is often the case, the lexical items have come from a text or form part of a group of some kind, sense disambiguation software will be needed to ensure that the right meaning is being practised. Anyone who has used flashcard apps knows that the major problem is usually the quality of the content (whether it has been automatically generated or written by someone).

A further challenge is the generation of distractors. In the example here (from Memrise), the distractors have been so badly generated as to render the task more or less a complete waste of time. Distractors must, in some way, be viable alternatives (Smith et al., 2010) but still clearly wrong. That means they should normally be the same part of speech and true cognates should be avoided. Research into the automatic generation of distractors is well-advanced (see, for instance, Kumar at al., 2015) with Smith et al (2010), for example, using a very large corpus and various functions of Sketch Engine (the most well-known corpus query tool) to find collocates and other distractors. Their TEDDCLOG (Testing English with Data-Driven CLOze Generation) system produced distractors that were deemed acceptable 91% of the time. Whilst impressive, there is still a long way to go before human editing / rewriting is no longer needed.

One area that has attracted attention is, of course, tests, and some tasks, such as those in TOEFL (see image). Susanti et al (2015, 2017) were able, given a target word, to automatically generate a reading passage from web sources along with questions of the TOEFL kind. However, only about half of them were considered good enough to be used in actual tests. Again, that is some way off avoiding human intervention altogether, but the automatically generated texts and questions can greatly facilitate the work of human item writers.

toefl task


Other tools that might be useful include the University of Nottingham AWL (Academic Word List) Gapmaker . This allows users to type or paste in a text, from which items from the AWL are extracted and replaced as a gap. See the example below. It would, presumably, not be too difficult, to combine this approach with automatic distractor generation and to create multiple choice tasks.


WordGapThere are a number of applications that offer the possibility of generating cloze tasks from texts selected by the user (learner or teacher). These have not always been designed with the language learner in mind but one that was is the Android app, WordGap (Knoop & Wilske, 2013). Described by its developers as a tool that ‘provides highly individualized exercises to support contextualized mobile vocabulary learning …. It matches the interests of the learner and increases the motivation to learn’. It may well do all that, but then again, perhaps not. As Knoop & Wilske acknowledge, it is only appropriate for adult, advanced learners and its value as a learning task is questionable. The target item that has been automatically selected is ‘novel’, a word that features in the list Oxford 2000 Keywords (as do all three distractors), and therefore ought to be well below the level of the users. Some people might find this fun, but, in terms of learning, they would probably be better off using an app that made instant look-up of words in the text possible.

More interesting, in my view, is TEDDCLOG (Smith et al., 2010), a system that, given a target learning item (here the focus is on collocations), trawls a large corpus to find the best sentence that illustrates it. ‘Good sentences’ were defined as those which were short (but not too short, or there is not enough useful context, begins with a capital letter and ends with a full stop, has a maximum of two commas; and otherwise contains only the 26 lowercase letters. It must be at a lexical and grammatical level that an intermediate level learner of English could be expected to understand. It must be well-formed and without too much superfluous material. All others were rejected. TEDDCLOG uses Sketch Engine’s GDEX function (Good Dictionary Example Extractor, Kilgarriff et al 2008) to do this.

My own interest in this area came about as a result of my work in the development of the Oxford Vocabulary Trainer . The app offers the possibility of studying both pre-determined lexical items (e.g. the vocabulary list of a coursebook that the learner is using) and free choice (any item could be activated and sent to a learning queue). In both cases, practice takes the form of sentences with the target item gapped. There are a range of hints and help options available to the learner, and feedback is both automatic and formative (i.e. if the supplied answer is not correct, hints are given to push the learner to do better on a second attempt). Leveraging some fairly heavy technology, we were able to achieve a fair amount of success in the automation of intelligent feedback, but what had, at first sight, seemed a lesser challenge – the generation of suitable ‘carrier sentences’, proved more difficult.

The sentences which ‘carry’ the gap should, ideally, be authentic: invented examples often ‘do not replicate the phraseology and collocational preferences of naturally-occurring text’ (Smith et al., 2010). The technology of corpus search tools should allow us to do a better job than human item writers. For that to be the case, we need not only good search tools but a good corpus … and some are better than others for the purposes of language learning. As Fenogenova & Kuzmenko (2016) discovered when using different corpora to automatically generate multiple choice vocabulary exercises, the British Academic Written English corpus (BAWE) was almost 50% more useful than the British National Corpus (BNC). In the development of the Oxford Vocabulary Trainer, we thought we had the best corpus we could get our hands on – the tagged corpus used for the production of the Oxford suite of dictionaries. We could, in addition and when necessary, turn to other corpora, including the BAWE and the BNC. Our requirements for acceptable carrier sentences were similar to those of Smith et al (2010), but were considerably more stringent.

To cut quite a long story short, we learnt fairly quickly that we simply couldn’t automate the generation of carrier sentences with sufficient consistency or reliability. As with some of the other examples discussed in this post, we were able to use the technology to help the writers in their work. We also learnt (rather belatedly, it has to be admitted) that we were trying to find technological solutions to problems that we hadn’t adequately analysed at the start. We hadn’t, for example, given sufficient thought to learner differences, especially the role of L1 (and other languages) in learning English. We hadn’t thought enough about the ‘messiness’ of either language or language learning. It’s possible, given enough resources, that we could have found ways of improving the algorithms, of leveraging other tools, or of deploying additional databases (especially learner corpora) in our quest for a personalised vocabulary learning system. But, in the end, it became clear to me that we were only nibbling at the problem of vocabulary learning. Deliberate learning of vocabulary may be an important part of acquiring a language, but it remains only a relatively small part. Technology may be able to help us in a variety of ways (and much more so in testing than learning), but the dreams of the data scientists (who wrote much of the research cited here) are likely to be short-lived. Experienced writers and editors of learning materials will be needed for the foreseeable future. And truly personalized vocabulary learning, fully supported by technology, will not be happening any time soon.



Fenogenova, A. & Kuzmenko, E. 2016. Automatic Generation of Lexical Exercises Available online at

Iwata, T., Goto, T., Kojiri, T., Watanabe, T. & T. Yamada. 2011. ‘Automatic Generation of English Cloze Questions Based on Machine Learning’. NTT Technical Review Vol. 9 No. 10 Oct. 2011

Kilgarriff, A. et al. 2008. ‘GDEX: Automatically Finding Good Dictionary Examples in a Corpus.’ In E. Bernal and J. DeCesaris (eds.), Proceedings of the XIII EURALEX International Congress: Barcelona, 15-19 July 2008. Barcelona: l’Institut Universitari de Lingüística Aplicada (IULA) dela Universitat Pompeu Fabra, 425–432.

Knoop, S. & Wilske, S. 2013. ‘WordGap – Automatic generation of gap-filling vocabulary exercises for mobile learning’. Proceedings of the second workshop on NLP for computer-assisted language learning at NODALIDA 2013. NEALT Proceedings Series 17 / Linköping Electronic Conference Proceedings 86: 39–47. Available online at

Kumar, G., Banchs, R.E. & D’Haro, L.F. 2015. ‘RevUP: Automatic Gap-Fill Question Generation from Educational Texts’. Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 2015, pp. 154–161, Denver, Colorado, June 4, Association for Computational Linguistics

Smith, S., Avinesh, P.V.S. & Kilgariff, A. 2010. ‘Gap-fill tests for Language Learners: Corpus-Driven Item Generation’. Proceedings of ICON-2010: 8th International Conference on Natural Language Processing, Macmillan Publishers, India. Available online at

Susanti, Y., Iida, R. & Tokunaga, T. 2015. ‘Automatic Generation of English Vocabulary Tests’. Proceedings of 7th International Conference on Computer Supported Education. Available online

Susanti, Y., Tokunaga, T., Nishikawa, H. & H. Obari 2017. ‘Evaluation of automatically generated English vocabulary questions’ Research and Practice in Technology Enhanced Learning 12 / 11


  1. eflnotes says:

    hi Philip

    might it be useful to consider the difference between learning a new word and retrieving words partially/fully learned? i.e. between initial comprehension and long-term retention?

    a recent study has found benefits of reduced contextual information – []

    (glad to see it was only your namesake that has been trending on twitter : )

    • philipjkerr says:

      Hi Mura
      Yes, I am sure you’re right. I conflated a lot of things in this post. The whole business of the value of contextual information is complex, to say the least.
      Shame about Philip Kerr. I’ve enjoyed quite a few of his books.

  2. gotanda says:

    Hi Philip,

    I share this on and received a question there.

    “Interesting read. Do you know more about the approach they choose to prodict the carrier sentences/flash carxs/distracts?

    … I mean which computational linguistic model?”

    Reading around the typos (thumbs on phone I guess) can you share anything about the approaches you tried for automating the production of carrier sentences?


    • philipjkerr says:

      Hi Ted
      It’s not the typos that throw me – it’s the question! Everyone doing this kind of thing (I think) uses fairly similar approaches, although the corpora may well vary, and the approaches are described in some detail in the research I cite. The approach in the company I was working with used a fairly standard combination of n-gram analysis with semantic analysis (using a very large tagged multilingual dictionary corpus). But I’m no expert on the details (my role was essentially to evaluate the results of what the programmers came up with).
      Best wishes

  3. Matt Byrski says:

    Hi there,
    I’m just an ordinary English teacher who stumbled upon this post when browsing. A lot of this may inevitably go over my head but it seems comfortably reassuring that machines will not be able to take over just yet.
    I wanted to ask for advice… I use quizlet to produce ‘personalised’ sets of flashcards for the groups I teach. We identify the words the group (by asking individuals though) want to remember from a lesson and I feed them into a quizlet set with a definition and a gapped sentence(s) which I consider would help illustrate the context in which we have originally come across the word. Quizlet boasts a number of functionalities which are claimed to help deliberate vocabulary learning. One such functionality is known as ‘Learn’. It produces a multiple-choice task using other definitions from the set as distractors. I gather from your article that this may not be very effective. Additionally, I think I spoil this activity even more because I – as a rule – leave out the first letters of the words I gap in the definitions/sentences for the terms my groups are trying to learn. The rationale behind this is that (and this is how I explain this to my students) – in a few months time (after the course ends) – when they look at the set again, they will probably be able to come up with a number of words which have a similar meaning (in some cases at least) – or what I could now call (having read your post) personalised distractors (source: the ‘corpus’ of words they seem to know already). And this is why I leave the first letters in the gaps – I want them to be able to eliminate these distractors and try to ‘guess’ the target vocabulary. This – however – may affect the effectivness of some automated functionalities provided by quizlet.
    So, here is the question I have been asking myself for some time now. Should I leave the first letters in the gaps or should I gap the words completely? I am not an academic; I try to read up as much as I can (this blog for example) but I don’t even know if there is any validity to what I am trying to do. Would you leave the first letters or not?
    Much obliged,

    • philipjkerr says:

      Hi Matt
      I haven’t played around much with the ‘Learn’ feature on Quizlet, but here are a few thoughts.
      The automatic generation of the distractors doesn’t work terribly well with Quizlet because (1) of its random nature, and (2) in most cases, I imagine, the set that is selecting from will be too small. But this may not be very important … I’ll come to that in a second. Deleting first letters (or any other letters) from the target item ought to increase the ‘involvement load’ and should, therefore, increase learning effectiveness. At least, I certainly don’t think it should detract from the usefulness of the task. But again it may not really make much difference …
      I think that Quizlet have realised that the most significant factor impacting on how much vocabulary students will learn from using flashcards is simply the number of exposures to the learning item that they get. The addition of this ‘Learn’ feature isn’t going to change much of anything, but, by offering a bit more variety in terms of task type, there’s a reasonable hope that learners will be more motivated to spend more time on app.
      So – would I do what you’re doing? Sometimes, yes, sometimes, no. I think that variety is the key. We should also remember that this kind of deliberate study of vocabulary only takes the learners so far. There comes a point when they need to use the items (and not just by tapping on them or typing them out) – communicatively and meaningfully, if there is to be any real chance of the items entering their active lexicon.
      Does that make sense?

      • Matt Byrski says:

        Thank you, Philip. Yes, it does make sense. In future, I will probably leave the first letters when gapping definitions or example sentences when the phrase/word might be of the kind that will be difficult for the students to recall. As I use Quizlet fairly regularly at the beginning/end of my lessons to revise – this may also be o good test of what my students find hard to guess.
        Regarding the communicative and meaningful use, I have been working on a number of activities that aim to exploit quizlet sets to that end (games, role-play, communication gap activities etc.).
        So, once again – thank you for advice I will take this on board.
        As the original post is about automated vocabulary practice, I would like to share the link below with you. I met the founder of this project at an IATEFL Poland conference and it seemed to be something that may combine AI (Artificial Intelligence) with algorithm-based tools. Here’s a link to their promotional video:

  4. philipjkerr says:

    Thanks, Matt.
    As for the link …. the experiences I described in the blogpost come from the work I did with Alphary! And the guy you met at IATEFL Poland is my friend, Daniel Gorin, who first got me involved in this world of flashcards.

  5. […] This shouldn’t surprise anyone. Automation of some of these tasks is extremely difficult (see my post about the automated generation of vocabulary learning materials). Perhaps impossible … but how much error is […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s