NB This is an edited version of the original review.

Words & Monsters is a new vocabulary app that has caught my attention. There are three reasons for this. Firstly, because it’s free. Secondly, because I was led to believe (falsely, as it turns out) that two of the people behind it are Charles Browne and Brent Culligan, eminently respectable linguists, who were also behind the development of the New General Service List (NGSL), based on data from the Cambridge English Corpus. And thirdly, because a lot of thought, effort and investment have clearly gone into the gamification of Words & Monsters (WAM). It’s to the last of these that I’ll turn my attention first.

WAM teaches vocabulary in the context of a battle between a player’s avatar and a variety of monsters. If users can correctly match a set of target items to definitions or translations in the available time, they ‘defeat’ the monster and accumulate points. The more points you have, the higher you advance through a series of levels and ranks. There are bonuses for meeting daily and weekly goals, there are leaderboards, and trophies and medals can be won. In addition to points, players also win ‘crystals’ after successful battles, and these crystals can be used to buy accessories which change the appearance of the avatar and give the player added ‘powers’. I was never able to fully understand precisely how these ‘powers’ affected the number of points I could win in battle. It remained as baffling to me as the whole system of values with Pokemon cards, which is presumably a large part of the inspiration here. Perhaps others, more used to games like Pokemon, would find it all much more transparent.

The system of rewards is all rather complicated, but perhaps this doesn’t matter too much. In fact, it might be the case that working out how reward systems work is part of what motivates people to play games. But there is another aspect to this: the app’s developers refer in their bumf to research by Howard-Jones and Jay (2016), which suggests that when rewards are uncertain, more dopamine is released in the mid-brain and this may lead to reinforcement of learning, and, possibly, enhancement of declarative memory function. Possibly … but Howard-Jones and Jay point out that ‘the science required to inform the manipulation of reward schedules for educational benefit is very incomplete.’ So, WAM’s developers may be jumping the gun a little and overstating the applicability of the neuroscientific research, but they’re not alone in that!

If you don’t understand a reward system, it’s certain that the rewards are uncertain. But WAM takes this further in at least two ways. Firstly, when you win a ‘battle’, you have to click on a plain treasure bag to collect your crystals, and you don’t know whether you’ll get one, two, three, or zero, crystals. You are given a semblance of agency, but, essentially, the whole thing is random. Secondly, when you want to convert your crystals into accessories for your avatar, random selection determines which accessory you receive, even though, again, there is a semblance of agency. Different accessories have different power values. This extended use of what the developers call ‘the thrill of uncertain rewards’ is certainly interesting, but how effective it is is another matter. My own reaction, after quite some time spent ‘studying’, to getting no crystals or an avatar accessory that I didn’t want was primarily frustration, rather than motivation to carry on. I have no idea how typical my reaction (more ‘treadmill’ than ‘thrill’) might be.

Unsurprisingly, for an app that has so obviously thought carefully about gamification, players are encouraged to interact with each other. As part of the early promotion, WAM is running, from 15 November to 19 December, a free ‘team challenge tournament’, allowing teams of up to 8 players to compete against each other. Ingeniously, it would appear to allow teams and players of varying levels of English to play together, with the app’s algorithms determining each individual’s level of lexical knowledge and therefore the items that will be presented / tested. Social interaction is known to be an important component of successful games (Dehghanzadeh et al., 2019), but for vocabulary apps there’s a huge challenge. In order to learn vocabulary from an app, learners need to put in time – on a regular basis. Team challenge tournaments may help with initial on-boarding of players, but, in the end, learning from a vocabulary app is inevitably and largely a solitary pursuit. Over time, social interaction is unlikely to be maintained, and it is, in any case, of a very limited nature. The other features of successful games – playful freedom and intrinsically motivating tasks (Driver, 2012) – are also absent from vocabulary apps. Playful freedom is mostly incompatible with points, badges and leaderboards. And flashcard tasks, however intrinsically motivating they may be at the outset, will always become repetitive after a while. In the end, what’s left, for those users who hang around long enough, is the reward system.

It’s also worth noting that this free challenge is of limited duration: it is a marketing device attempting to push you towards the non-free use of the app, once the initial promotion is over.

Gamified motivation tools are only of value, of course, if they motivate learners to spend their time doing things that are of clear learning value. To evaluate the learning potential of WAM, then, we need to look at the content (the ‘learning objects’) and the learning tasks that supposedly lead to acquisition of these items.

When you first use WAM, you need to play for about 20 minutes, at which point algorithms determine ‘how many words [you] know and [you can] see scores for English tests such as; TOEFL, TOEIC, IELTS, EIKEN, Kyotsu Shiken, CEFR, SAT and GRE’. The developers claim that these scores correlate pretty highly with actual test scores: ‘they are about as accurate as the tests themselves’, they say. If Browne and Culligan had been behind the app, I would have been tempted to accept the claim – with reservations: after all, it still allows for one item out of 5 to be wrongly identified. But, what is this CEFR test score that is referred to? There is no CEFR test, although many tests are correlated with CEFR. The two tools that I am most familiar with which allocate CEFR levels to individual words – Cambridge’s English Vocabulary Profile and Pearson’s Global Scale of English – often conflict in their results. I suspect that ‘CEFR’ was just thrown into the list of tests as an attempt to broaden the app’s appeal.

English target words are presented and practised with their translation ‘equivalents’ in Japanese. For the moment, Japanese is the only language available, which means the app is of little use to learners who don’t know any Japanese. It’s now well-known that bilingual pairings are more effective in deliberate language learning than using definitions in the same language as the target items. This becomes immediately apparent when, for example, a word like ‘something’ is defined (by WAM) as ‘a thing not known or specified’ and ‘anything’ as ‘a thing of whatever kind’. But although I’m in no position to judge the Japanese translations, there are reasons why I would want to check the spreadsheet before recommending the app. ‘Lady’ is defined as ‘polite word for a woman’; ‘missus’ is defined as ‘wife’; and ‘aye’ is defined as ‘yes’. All of these definitions are, at best, problematic; at worst, they are misleading. Are the Japanese translations more helpful? I wonder … Perhaps these are simply words that do not lend themselves to flashcard treatment?

Because I tested in to the app at C1 level, I was not able to evaluate the selection of words at lower levels. A pity. Instead, I was presented with words like ‘ablution’, ‘abrade’, ‘anode’, and ‘auspice’. The app claims to be suitable ‘for both second-language learners and native speakers’. For lower levels of the former, this may be true (but without looking at the lexical spreadsheets, I can’t tell). But for higher levels, however much fun this may be for some people, it seems unlikely that you’ll learn very much of any value. Outside of words in, say, the top 8000 frequency band, it is practically impossible to differentiate the ‘surrender value’ of words in any meaningful way. Deliberate learning of vocabulary only makes sense with high frequency words that you have a chance of encountering elsewhere. You’d be better off reading, extensively, rather than learning random words from an app. Words, which (for reasons I’ll come on to) you probably won’t actually learn anyway.

With very few exceptions, the learning objects in WAM are single words, rather than phrases, even when the item is of little or no value outside its use in a phrase. ‘Betide’ is defined as ‘to happen to; befall’ but this doesn’t tell a learner much that is useful. It’s practically only ever used following ‘woe’ (but what does ‘woe’ mean?!). Learning items can be checked in the ‘study guide’, which will show that ‘betide’ typically follows ‘woe’, but unless you choose to refer to the study guide (and there’s no reason, in a case like this, that you would know that you need to check things out more fully), you’ll be none the wiser. In other words, checking the study guide is unlikely to betide you. ‘Wee’, as another example, is treated as two items: (1) meaning ‘very small’ as in ‘wee baby’, and (2) meaning ‘very early in the morning’ as in ‘in the wee hours’. For the latter, ‘wee’ can only collocate with ‘in the’ and ‘hours’, so it makes little sense to present it as a single word. This is also an example of how, in some cases, different meanings of particular words are treated as separate learning objects, even when the two meanings are very close and, in my view, are hardly worth learning separately. Examples include ‘czar’ and ‘assonance’. Sometimes, cognates are treated as separate learning objects (e.g. ‘adulterate’ and ‘adulteration’ or ‘dolor’ and ‘dolorous’); with other words (e.g. ‘effulgence’), only one grammatical form appears to be given. I could not begin to figure out any rationale behind any of this.

All in all, then, there are reasons to be a little skeptical about some of the content. Up to level B2 – which, in my view, is the highest level at which it makes sense to use vocabulary flashcards – it may be of value, so long as your first language is Japanese. But given the claim that it can help you prepare for the ‘CEFR test’, I have to wonder …

The learning tasks require players to match target items to translations / definitions (in both directions), with the target item sometimes in written form, sometimes spoken. Users do not, as far as I can tell, ever have to produce the target item: they only have to select. The learning relies on spaced repetition, but there is no generative effect (known to enhance memorisation). When I was experimenting, there were a few words that I did not know, but I was usually able to get the correct answer by eliminating the distractors (a choice of one from three gives players a reasonable chance of guessing correctly). WAM does not teach users how to produce words; its focus is on receptive knowledge (of a limited kind). I learn, for example, what a word like ‘aye’ or ‘missus’ kind of means, but I learn nothing about how to use it appropriately. Contrary to the claims in WAM’s bumf (that ‘all senses and dimensions of each word are fully acquired’), reading and listening comprehension speeds may be improved, but appropriate and accurate use of these words in speaking and writing is much less likely to follow. Does WAM really ‘strengthen and expand the foundation levels of cognition that support all higher level thinking’, as is claimed?

Perhaps it’s unfair to mention some of the more dubious claims of WAM’s promotional material, but here is a small selection, anyway: ‘WAM unleashes the full potential of natural motivation’. ‘WAM promotes Flow by carefully managing the ratio of unknown words. Your mind moves freely in the channel below frustration and above boredom’.

WAM is certainly an interesting project, but, like all the vocabulary apps I have ever looked at, there have to be trade-offs between optimal task design and what will fit on a mobile screen, between freedoms and flexibility for the user and the requirements of gamified points systems, between the amount of linguistic information that is desirable and the amount that spaced repetition can deal with, between attempting to make the app suitable for the greatest number of potential users and making it especially appropriate for particular kinds of users. Design considerations are always a mix of the pedagogical and the practical / commercial. And, of course, the financial. And, like most edtech products, the claims for its efficacy need to be treated with a bucket of salt.


Dehghanzadeh, H., Fardanesh, H., Hatami, J., Talaee, E. & Noroozi, O. (2019) Using gamification to support learning English as a second language: a systematic review, Computer Assisted Language Learning, DOI: 10.1080/09588221.2019.1648298

Driver, P. (2012) The Irony of Gamification. In English Digital Magazine 3, British Council Portugal, pp. 21 – 24 http://digitaldebris.info/digital-debris/2011/12/31/the-irony-of-gamification-written-for-ied-magazine.html

Howard-Jones, P. & Jay, T. (2016) Reward, learning and games. Current Opinion in Behavioral Sciences, 10: 65 – 72

  1. Guy Cihi says:

    Hello Mr. Kerr,
    A friend of mine sent me a link to your article. Wow, thank you very much for your extensive review of WAM. I will keep your observations in mind as we move forward in improving what we have and adding the speaking and pronunciation practice tasks. Vocabulary and pronunciation… the Prince Charming and Cinderella of language education together at last and free to the world.
    If you have not seen it, this short video covers the adaptive aspects of WAM.
    This review paper covers the underpinnings and clarifies the objectives we are seeking to achieve with WAM.
    I wish you could have experienced a group battle together with other players before writing your review. Playing alone is nothing like the multi-faceted experience and motivation that comes with a multiplayer experience. If you are interested, I can add you to a group so you can experience group battles. I would be very interested to hear your impressions and thoughts about it.
    Kind Regards,

    • philipjkerr says:

      Hi Guy, Thanks for the links. The V-Check diagnostic tool is clearly an important part of WAM. You claim that ‘Based on the latest research, V-Check is the most accurate vocabulary test available’. Would you mind letting me have links to that research? I’m intrigued because different vocabulary tests measure different things (with clear differences between recognition and productive knowledge), so I’m not clear how one test can be the ‘most accurate’. Thanks, Philip

      • Guy says:

        To my knowledge, V-Check is only Computer Adaptive Test of vocabulary employing IRT item difficulty analysis to determine each respondent’s vocabulary composition ie, an inventory of specific words known. Two other popular vocabulary CATs I am aware of; the VLT and VST, operate on the false assumption that item frequency is meaningfully correlated to item difficulty – it is not.

        Is there another vocabulary test you are thinking about? Perhaps I have overlooked something.

        This link offers several research and review papers that you may find helpful.


  2. philipjkerr says:

    Thanks for the links to the various papers.
    I think there is a significant difference between the claim that ‘V-Check is the most accurate vocabulary test available’ and your clarification that ‘V-Check is only Computer Adaptive Test of vocabulary employing IRT item difficulty analysis to determine each respondent’s vocabulary composition’. Research into computer-adaptive tests is in its infancy, as far as I understand. I have found the work of Benjamin Kremmel, beginning with his doctoral thesis on the ‘Development and initial validation of a diagnostic computer-adaptive profiler of vocabulary knowledge’, very interesting.
    IRT may well be a more reliable approach than simple frequency counts, but, as one of the papers by Brent Culligan states, it is ‘a probabilistic model’: the actionable data it produces are means, but not a reliable, precise number of ‘unknown words’ for any given individual learner. Even less can it precisely identify what those words might be. It can only be probabilistic, given individual learner differences, compounded by the difficulties of defining what exactly a ‘word’ is and what we mean when we say someone ‘knows’ it.
    In one of the WAM documents, you give the example of an intermediate learner with ‘a total vocabulary size of 3,900 words’ who is typically missing 236 words from the essential first 2,000 most frequent words of English’. You then go on to say that ‘Words & Monsters will teach those 236 words first’. But it is theoretically and practically impossible to know precisely what those words are for any given individual. In one of your own papers, you explain that you establish the words that need to be learnt by comparing an individual’s ‘lexit of ability’ against a frequency-ranked database of items in a particular corpus, allowing you to ‘determine which high-frequency words the learner likely knows and does not know in each of the sub domains’. The key word is ‘likely’. However, elsewhere, you suggest a degree of certainty that is not warranted. This appears to me to be an overstated claim.
    If we then consider what precisely these 236 words might be, we run into other difficulties. Most lists of high frequency words do not differentiate the different meanings of these words (and since they are high frequency words, they are typically polysemantic); some lists do not even differentiate parts of speech of the same combinations of letters. So, the question is do we teach words or word senses? Do we teach the associated syntactic and collocational patterns (without which learners can’t really be said to ‘know’ the items)? And is it possible to do these things with a flashcard system? Again, my impression is that you are making claims that are unwarranted, that you are offering a veneer of science to provide simple solutions (your solutions) to complicated issues that are not amenable to simple solutions.
    One more example … One of the papers you link to (McClean, Hogg & Rush, 2013) shows that digital flashcard use can lead to gains in lexical knowledge, as measured by Nation’s VST test. The flashcards they investigated used your company’s WordEngine. Their finding was that weekly use of these flashcards ‘may increase vocabulary sizes’. The key word is ‘may’. The authors are careful with their findings: it is ‘plausible’, they say, that the gains in vocabulary knowledge were due to use of WordEngine. Interestingly, they also note from another study that 40% of students disliked using WordEngine (Agawa, Black & Herriman, 2011). You also link to this paper. These researchers also looked at WordEngine for TOEIC students. Their conclusion was ‘What remains to be demonstrated is under which conditions and to what extent Word Engine can be said to objectively lead to increases in TOEIC scores’. This is all rather different from your claim in a ‘Words & Monsters’ document that ‘The conclusion in all cases is that Lexxica’s digital flashcards significantly increase average test scores higher than the other approaches that were tested.’ The research findings you quote are far more hedged and much less clear-cut than you suggest. They do not, in any case, refer to ‘Words & Monsters’, but to an earlier product, and even if the technology at the back of the product is the same or similar, other aspects are very different, meaning that we could not apply any findings about one to the other.
    I think you have made very selective use of the research that you provide in support of your product. In some cases, you have chosen to ignore parts of the research or you seem to have misrepresented it.

  3. Guy Cihi says:

    Mr. Kerr,
    A teacher wrote to me today and mentioned that they had seen your review of Words & Monsters. That is unfortunate. I was hoping to avoid this conversation because you are a truculent person. Sadly, this cannot be avoided because there are likely going to be more misunderstandings caused by your cursory and inaccurate review.

    Lexxica is data-based services provider. All user progress data is retained on our servers for 18 months following the last online interaction at which time the data is permanently deleted. It should not surprise you that it was easy for me to locate your Words & Monsters player account and see that you spent less than 6 minutes with the game.

    If you had spent even just 20 minutes playing Words & Monsters, you would have experienced how VCheck adapts to each new player’s ability and you would have been able to experience all of the features that the game freely provides; well, that is if you made some small effort to look around and kick the tires.

    The VCheck adaptive test process takes about 15 minutes to complete. Since you spent less than 6 minutes playing Words & Monsters, and no time whatsoever with the free teacher’s LMS program that accompanies Words & Monsters, it is clear that you do not have enough experience to write a serious review of Words & Monsters.

    Under the circumstances, I ask that you remove the review from your blog.

    I am,
    Guy Cihi

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s