Posts Tagged ‘flashcards’

Digital flashcard systems like Memrise and Quizlet remain among the most popular language learning apps. Their focus is on the deliberate learning of vocabulary, an approach described by Paul Nation (Nation, 2005) as ‘one of the least efficient ways of developing learners’ vocabulary knowledge but nonetheless […] an important part of a well-balanced vocabulary programme’. The deliberate teaching of vocabulary also features prominently in most platform-based language courses.

For both vocabulary apps and bigger courses, the lexical items need to be organised into sets for the purposes of both presentation and practice. A common way of doing this, especially at lower levels, is to group the items into semantic clusters (sets with a classifying superordinate, like body part, and a collection of example hyponyms, like arm, leg, head, chest, etc.).

The problem, as Keith Folse puts it, is that such clusters ‘are not only unhelpful, they actually hinder vocabulary retention’ (Folse, 2004: 52). Evidence for this claim may be found in Higa (1963), Tinkham (1993, 1997), Waring (1997), Erten & Tekin (2008) and Barcroft (2015), to cite just some of the more well-known studies. The results, says Folse, ‘are clear and, I think, very conclusive’. The explanation that is usually given draws on interference theory: semantic similarity may lead to confusion (e.g. when learners mix up days of the week, colour words or adjectives to describe personality).

It appears, then, to be long past time to get rid of semantic clusters in language teaching. Well … not so fast. First of all, although most of the research sides with Folse, not all of it does. Nakata and Suzuki (2019) in their survey of more recent research found that results were more mixed. They found one study which suggested that there was no significant difference in learning outcomes between presenting words in semantic clusters and semantically unrelated groups (Ishii, 2015). And they found four studies (Hashemi & Gowdasiaei, 2005; Hoshino, 2010; Schneider, Healy, & Bourne, 1998, 2002) where semantic clusters had a positive effect on learning.

Nakata and Suzuki (2019) offer three reasons why semantic clustering might facilitate vocabulary learning: it (1) ‘reflects how vocabulary is stored in the mental lexicon, (2) introduces desirable difficulty, and (3) leads to extra attention, effort, or engagement from learners’. Finkbeiner and Nicol (2003) make a similar point: ‘although learning semantically related words appears to take longer, it is possible that words learned under these conditions are learned better for the purpose of actual language use (e.g., the retrieval of vocabulary during production and comprehension). That is, the very difficulty associated with learning the new labels may make them easier to process once they are learned’. Both pairs of researcher cited in this paragraph conclude that semantic clusters are best avoided, but their discussion of the possible benefits of this clustering is a recognition that the research (for reasons which I will come on to) cannot lead to categorical conclusions.

The problem, as so often with pedagogical research, is the gap between research conditions and real-world classrooms. Before looking at this in a little more detail, one relatively uncontentious observation can be made. Even those scholars who advise against semantic clustering (e.g. Papathanasiou, 2009), acknowledge that the situation is complicated by other factors, especially the level of proficiency of the learner and whether or not one or more of the hyponyms are known to the learner. At higher levels (when it is more likely that one or more of the hyponyms are already, even partially, known), semantic clustering is not a problem. I would add that, on the whole at higher levels, the deliberate learning of vocabulary is even less efficient than at lower levels and should be an increasingly small part of a well-balanced vocabulary programme.

So, why is there a problem drawing practical conclusions from the research? In order to have any scientific validity at all, researchers need to control a large number of variable. They need, for example, to be sure that learners do not already know any of the items that are being presented. The only practical way of doing this is to present sets of invented words, and this is what most of the research does (Sarioğlu, 2018). These artificial words solve one problem, but create others, the most significant of which is item difficulty. Many factors impact on item difficulty, and these include word frequency (obviously a problem with invented words), word length, pronounceability and the familiarity and length of the corresponding item in L1. None of the studies which support the abandonment of semantic clusters have controlled all of these variables (Nakata and Suzuki, 2019). Indeed, it would be practically impossible to do so. Learning pseudo-words is a very different proposition to learning real words, which a learner may subsequently encounter or want to use.

Take, for example, the days of the week. It’s quite common for learners to muddle up Tuesday and Thursday. The reason for this is not just semantic similarity (Tuesday and Monday are less frequently confused). They are also very similar in terms of both spelling and pronunciation. They are ‘synforms’ (see Laufer, 2009), which, like semantic clusters, can hinder learning of new items. But, now imagine a French-speaking learner of Spanish studying the days of the week. It is much less likely that martes and jueves will be muddled, because of their similarity to the French words mardi and jeudi. There would appear to be no good reason not to teach the complete set of days of the week to a learner like this. All other things being equal, it is probably a good idea to avoid semantic clusters, but all other things are very rarely equal.

Again, in an attempt to control for variables, researchers typically present the target items in isolation (in bilingual pairings). But, again, the real world does not normally conform to this condition. Leo Sellivan (2014) suggests that semantic clusters (e.g. colours) are taught as part of collocations. He gives the examples of red dress, green grass and black coffee, and points out that the alliterative patterns can serve as mnemonic devices which will facilitate learning. The suggestion is, I think, a very good one, but, more generally, it’s worth noting that the presentation of lexical items in both digital flashcards and platform courses is rarely context-free. Contexts will inevitably impact on learning and may well obviate the risks of semantic clustering.

Finally, this kind of research typically gives participants very restricted time to memorize the target words (Sarioğlu, 2018) and they are tested in very controlled recall tasks. In the case of language platform courses, practice of target items is usually spread out over a much longer period of time, with a variety of exposure opportunities (in controlled practice tasks, exposure in texts, personalisation tasks, revision exercises, etc.) both within and across learning units. In this light, it is not unreasonable to argue that laboratory-type research offers only limited insights into what should happen in the real world of language learning and teaching. The choice of learning items, the way they are presented and practised, and the variety of activities in the well-balanced vocabulary programme are probably all more significant than the question of whether items are organised into semantic clusters.

Although semantic clusters are quite common in language learning materials, much more common are thematic clusters (i.e. groups of words which are topically related, but include a variety of parts of speech (see below). Researchers, it seems, have no problem with this way of organising lexical sets. By way of conclusion, here’s an extract from a recent book:

‘Introducing new words together that are similar in meaning (synonyms), such as scared and frightened, or forms (synforms), like contain and maintain, can be confusing, and students are less likely to remember them. This problem is known as ‘interference’. One way to avoid this is to choose words that are around the same theme, but which include a mix of different parts of speech. For example, if you want to focus on vocabulary to talk about feelings, instead of picking lots of adjectives (happy, sad, angry, scared, frightened, nervous, etc.) include some verbs (feel, enjoy, complain) and some nouns (fun, feelings, nerves). This also encourages students to use a variety of structures with the vocabulary.’ (Hughes, et al., 2015: 25)

 

References

Barcroft, J. 2015. Lexical Input Processing and Vocabulary Learning. Amsterdam: John Benjamins

Erten, I.H., & Tekin, M. 2008. Effects on vocabulary acquisition of presenting new words in semantic sets versus semantically-unrelated sets. System, 36 (3), 407-422

Finkbeiner, M. & Nicol, J. 2003. Semantic category effects in second language word learning. Applied Psycholinguistics 24 (2003), 369–383

Folse, K. S. 2004. Vocabulary Myths. Ann Arbor: University of Michigan Press

Hashemi, M.R., & Gowdasiaei, F. 2005. An attribute-treatment interaction study: Lexical-set versus semantically-unrelated vocabulary instruction. RELC Journal, 36 (3), 341-361

Higa, M. 1963. Interference effects of intralist word relationships in verbal learning. Journal of Verbal Learning and Verbal Behavior, 2, 170-175

Hoshino, Y. 2010. The categorical facilitation effects on L2 vocabulary learning in a classroom setting. RELC Journal, 41, 301–312

Hughes, S. H., Mauchline, F. & Moore, J. 2019. ETpedia Vocabulary. Shoreham-by-Sea: Pavilion Publishing and Media

Ishii, T. 2015. Semantic connection or visual connection: Investigating the true source of confusion. Language Teaching Research, 19, 712–722

Laufer, B. 2009. The concept of ‘synforms’ (similar lexical forms) in vocabulary acquisition. Language and Education, 2 (2): 113 – 132

Nakata, T. & Suzuki, Y. 2019. Effects Of Massing And Spacing On The Learning Of Semantically Related And Unrelated Words. Studies in Second Language Acquisition 41 (2), 287 – 311

Nation, P. 2005. Teaching Vocabulary. Asian EFL Journal. http://www.asian-efl-journal.com/sept_05_pn.pdf

Papathanasiou, E. 2009. An investigation of two ways of presenting vocabulary. ELT Journal 63 (4), 313 – 322

Sarioğlu, M. 2018. A Matter of Controversy: Teaching New L2 Words in Semantic Sets or Unrelated Sets. Journal of Higher Education and Science Vol 8 / 1: 172 – 183

Schneider, V. I., Healy, A. F., & Bourne, L. E. 1998. Contextual interference effects in foreign language vocabulary acquisition and retention. In Healy, A. F. & Bourne, L. E. (Eds.), Foreign language learning: Psycholinguistic studies on training and retention (pp. 77–90). Mahwah, NJ: Erlbaum

Schneider, V. I., Healy, A. F., & Bourne, L. E. 2002. What is learned under difficult conditions is hard to forget: Contextual interference effects in foreign vocabulary acquisition, retention, and transfer. Journal of Memory and Language, 46, 419–440

Sellivan, L. 2014. Horizontal alternatives to vertical lists. Blog post: http://leoxicon.blogspot.com/2014/03/horizontal-alternatives-to-vertical.html

Tinkham, T. 1993. The effect of semantic clustering on the learning of second language vocabulary. System 21 (3), 371-380.

Tinkham, T. 1997. The effects of semantic and thematic clustering on the learning of a second language vocabulary. Second Language Research, 13 (2),138-163

Waring, R. 1997. The negative effects of learning words in semantic sets: a replication. System, 25 (2), 261 – 274

Advertisements

At a recent ELT conference, a plenary presentation entitled ‘Getting it right with edtech’ (sponsored by a vendor of – increasingly digital – ELT products) began with the speaker suggesting that technology was basically neutral, that what you do with educational technology matters far more than the nature of the technology itself. The idea that technology is a ‘neutral tool’ has a long pedigree and often accompanies exhortations to embrace edtech in one form or another (see for example Fox, 2001). It is an idea that is supported by no less a luminary than Chomsky, who, in a 2012 video entitled ‘The Purpose of Education’ (Chomsky, 2012), said that:

As far as […] technology […] and education is concerned, technology is basically neutral. It’s kind of like a hammer. I mean, […] the hammer doesn’t care whether you use it to build a house or whether a torturer uses it to crush somebody’s skull; a hammer can do either. The same with the modern technology; say, the Internet, and so on.

Womans hammerAlthough hammers are not usually classic examples of educational technology, they are worthy of a short discussion. Hammers come in all shapes and sizes and when you choose one, you need to consider its head weight (usually between 16 and 20 ounces), the length of the handle, the shape of the grip, etc. Appropriate specifications for particular hammering tasks have been calculated in great detail. The data on which these specifications is based on an analysis of the hand size and upper body strength of the typical user. The typical user is a man, and the typical hammer has been designed for a man. The average male hand length is 177.9 mm, that of the average woman is 10 mm shorter (Wang & Cai, 2017). Women typically have about half the upper body strength of men (Miller et al., 1993). It’s possible, but not easy to find hammers designed for women (they are referred to as ‘Ladies hammers’ on Amazon). They have a much lighter head weight, a shorter handle length, and many come in pink or floral designs. Hammers, in other words, are far from neutral: they are highly gendered.

Moving closer to educational purposes and ways in which we might ‘get it right with edtech’, it is useful to look at the smart phone. The average size of these devices has risen in recent years, and is now 5.5 inches, with the market for 6 inch screens growing fast. Why is this an issue? Well, as Caroline Criado Perez (2019: 159) notes, ‘while we’re all admittedly impressed by the size of your screen, it’s a slightly different matter when it comes to fitting into half the population’s hands. The average man can fairly comfortably use his device one-handed – but the average woman’s hand is not much bigger than the handset itself’. This is despite the fact the fact that women are more likely to own an iPhone than men  .

It is not, of course, just technological artefacts that are gendered. Voice-recognition software is also very biased. One researcher (Tatman, 2017) has found that Google’s speech recognition tool is 13% more accurate for men than it is for women. There are also significant biases for race and social class. The reason lies in the dataset that the tool is trained on: the algorithms may be gender- and socio-culturally-neutral, but the dataset is not. It would not be difficult to redress this bias by training the tool on a different dataset.

The same bias can be found in automatic translation software. Because corpora such as the BNC or COCA have twice as many male pronouns as female ones (as a result of the kinds of text that are selected for the corpora), translation software reflects the bias. With Google Translate, a sentence in a language with a gender-neutral pronoun, such as ‘S/he is a doctor’ is rendered into English as ‘He is a doctor’. Meanwhile, ‘S/he is a nurse’ is translated as ‘She is a nurse’ (Criado Perez, 2019: 166).

Datasets, then, are often very far from neutral. Algorithms are not necessarily any more neutral than the datasets, and Cathy O’Neil’s best-seller ‘Weapons of Math Destruction’ catalogues the many, many ways in which algorithms, posing as neutral mathematical tools, can increase racial, social and gender inequalities.

It would not be hard to provide many more examples, but the selection above is probably enough. Technology, as Langdon Winner (Winner, 1980) observed almost forty years ago, is ‘deeply interwoven in the conditions of modern politics’. Technology cannot be neutral: it has politics.

So far, I have focused primarily on the non-neutrality of technology in terms of gender (and, in passing, race and class). Before returning to broader societal issues, I would like to make a relatively brief mention of another kind of non-neutrality: the pedagogic. Language learning materials necessarily contain content of some kind: texts, topics, the choice of values or role models, language examples, and so on. These cannot be value-free. In the early days of educational computer software, one researcher (Biraimah, 1993) found that it was ‘at least, if not more, biased than the printed page it may one day replace’. My own impression is that this remains true today.

Equally interesting to my mind is the fact that all educational technologies, ranging from the writing slate to the blackboard (see Buzbee, 2014), from the overhead projector to the interactive whiteboard, always privilege a particular kind of teaching (and learning). ‘Technologies are inherently biased because they are built to accomplish certain very specific goals which means that some technologies are good for some tasks while not so good for other tasks’ (Zhao et al., 2004: 25). Digital flashcards, for example, inevitably encourage a focus on rote learning. Contemporary LMSs have impressive multi-functionality (i.e. they often could be used in a very wide variety of ways), but, in practice, most teachers use them in very conservative ways (Laanpere et al., 2004). This may be a result of teacher and institutional preferences, but it is almost certainly due, at least in part, to the way that LMSs are designed. They are usually ‘based on traditional approaches to instruction dating from the nineteenth century: presentation and assessment [and] this can be seen in the selection of features which are most accessible in the interface, and easiest to use’ (Lane, 2009).

The argument that educational technology is neutral because it could be put to many different uses, good or bad, is problematic because the likelihood of one particular use is usually much greater than another. There is, however, another way of looking at technological neutrality, and that is to look at its origins. Elsewhere on this blog, in post after post, I have given examples of the ways in which educational technology has been developed, marketed and sold primarily for commercial purposes. Educational values, if indeed there are any, are often an afterthought. The research literature in this area is rich and growing: Stephen Ball, Larry Cuban, Neil Selwyn, Joel Spring, Audrey Watters, etc.

Rather than revisit old ground here, this is an opportunity to look at a slightly different origin of educational technology: the US military. The close connection of the early history of the internet and the Advanced Research Projects Agency (now DARPA) of the United States Department of Defense is fairly well-known. Much less well-known are the very close connections between the US military and educational technologies, which are catalogued in the recently reissued ‘The Classroom Arsenal’ by Douglas D. Noble.

Following the twin shocks of the Soviet Sputnik 1 (in 1957) and Yuri Gagarin (in 1961), the United States launched a massive programme of investment in the development of high-tech weaponry. This included ‘computer systems design, time-sharing, graphics displays, conversational programming languages, heuristic problem-solving, artificial intelligence, and cognitive science’ (Noble, 1991: 55), all of which are now crucial components in educational technology. But it also quickly became clear that more sophisticated weapons required much better trained operators, hence the US military’s huge (and continuing) interest in training. Early interest focused on teaching machines and programmed instruction (branches of the US military were by far the biggest purchasers of programmed instruction products). It was essential that training was effective and efficient, and this led to a wide interest in the mathematical modelling of learning and instruction.

What was then called computer-based education (CBE) was developed as a response to military needs. The first experiments in computer-based training took place at the Systems Research Laboratory of the Air Force’s RAND Corporation think tank (Noble, 1991: 73). Research and development in this area accelerated in the 1960s and 1970s and CBE (which has morphed into the platforms of today) ‘assumed particular forms because of the historical, contingent, military contexts for which and within which it was developed’ (Noble, 1991: 83). It is possible to imagine computer-based education having developed in very different directions. Between the 1960s and 1980s, for example, the PLATO (Programmed Logic for Automatic Teaching Operations) project at the University of Illinois focused heavily on computer-mediated social interaction (forums, message boards, email, chat rooms and multi-player games). PLATO was also significantly funded by a variety of US military agencies, but proved to be of much less interest to the generals than the work taking place in other laboratories. As Noble observes, ‘some technologies get developed while others do not, and those that do are shaped by particular interests and by the historical and political circumstances surrounding their development (Noble, 1991: 4).

According to Noble, however, the influence of the military reached far beyond the development of particular technologies. Alongside the investment in technologies, the military were the prime movers in a campaign to promote computer literacy in schools.

Computer literacy was an ideological campaign rather than an educational initiative – a campaign designed, at bottom, to render people ‘comfortable’ with the ‘inevitable’ new technologies. Its basic intent was to win the reluctant acquiescence of an entire population in a brave new world sculpted in silicon.

The computer campaign also succeeded in getting people in front of that screen and used to having computers around; it made people ‘computer-friendly’, just as computers were being rendered ‘used-friendly’. It also managed to distract the population, suddenly propelled by the urgency of learning about computers, from learning about other things, such as how computers were being used to erode the quality of their working lives, or why they, supposedly the citizens of a democracy, had no say in technological decisions that were determining the shape of their own futures.

Third, it made possible the successful introduction of millions of computers into schools, factories and offices, even homes, with minimal resistance. The nation’s public schools have by now spent over two billion dollars on over a million and a half computers, and this trend still shows no signs of abating. At this time, schools continue to spend one-fifth as much on computers, software, training and staffing as they do on all books and other instructional materials combined. Yet the impact of this enormous expenditure is a stockpile of often idle machines, typically used for quite unimaginative educational applications. Furthermore, the accumulated results of three decades of research on the effectiveness of computer-based instruction remain ‘inconclusive and often contradictory’. (Noble, 1991: x – xi)

Rather than being neutral in any way, it seems more reasonable to argue, along with (I think) most contemporary researchers, that edtech is profoundly value-laden because it has the potential to (i) influence certain values in students; (ii) change educational values in [various] ways; and (iii) change national values (Omotoyinbo & Omotoyinbo, 2016: 173). Most importantly, the growth in the use of educational technology has been accompanied by a change in the way that education itself is viewed: ‘as a tool, a sophisticated supply system of human cognitive resources, in the service of a computerized, technology-driven economy’ (Noble, 1991: 1). These two trends are inextricably linked.

References

Biraimah, K. 1993. The non-neutrality of educational computer software. Computers and Education 20 / 4: 283 – 290

Buzbee, L. 2014. Blackboard: A Personal History of the Classroom. Minneapolis: Graywolf Press

Chomsky, N. 2012. The Purpose of Education (video). Learning Without Frontiers Conference. https://www.youtube.com/watch?v=DdNAUJWJN08

Criado Perez, C. 2019. Invisible Women. London: Chatto & Windus

Fox, R. 2001. Technological neutrality and practice in higher education. In A. Herrmann and M. M. Kulski (Eds), Expanding Horizons in Teaching and Learning. Proceedings of the 10th Annual Teaching Learning Forum, 7-9 February 2001. Perth: Curtin University of Technology. http://clt.curtin.edu.au/events/conferences/tlf/tlf2001/fox.html

Laanpere, M., Poldoja, H. & Kikkas, K. 2004. The second thoughts about pedagogical neutrality of LMS. Proceedings of IEEE International Conference on Advanced Learning Technologies, 2004. https://ieeexplore.ieee.org/abstract/document/1357664

Lane, L. 2009. Insidious pedagogy: How course management systems impact teaching. First Monday, 14(10). https://firstmonday.org/ojs/index.php/fm/article/view/2530/2303Lane

Miller, A.E., MacDougall, J.D., Tarnopolsky, M. A. & Sale, D.G. 1993. ‘Gender differences in strength and muscle fiber characteristics’ European Journal of Applied Physiology and Occupational Physiology. 66(3): 254-62 https://www.ncbi.nlm.nih.gov/pubmed/8477683

Noble, D. D. 1991. The Classroom Arsenal. Abingdon, Oxon.: Routledge

Omotoyinbo, D. W. & Omotoyinbo, F. R. 2016. Educational Technology and Value Neutrality. Societal Studies, 8 / 2: 163 – 179 https://www3.mruni.eu/ojs/societal-studies/article/view/4652/4276

O’Neil, C. 2016. Weapons of Math Destruction. London: Penguin

Sundström, P. Interpreting the Notion that Technology is Value Neutral. Medicine, Health Care and Philosophy 1, 1998: 42-44

Tatman, R. 2017. ‘Gender and Dialect Bias in YouTube’s Automatic Captions’ Proceedings of the First Workshop on Ethics in Natural Language Processing, pp. 53–59 http://www.ethicsinnlp.org/workshop/pdf/EthNLP06.pdf

Wang, C. & Cai, D. 2017. ‘Hand tool handle design based on hand measurements’ MATEC Web of Conferences 119, 01044 (2017) https://www.matec-conferences.org/articles/matecconf/pdf/2017/33/matecconf_imeti2017_01044.pdf

Winner, L. 1980. Do Artifacts have Politics? Daedalus 109 / 1: 121 – 136

Zhao, Y, Alvarez-Torres, M. J., Smith, B. & Tan, H. S. 2004. The Non-neutrality of Technology: a Theoretical Analysis and Empirical Study of Computer Mediated Communication Technologies. Journal of Educational Computing Research 30 (1 &2): 23 – 55

A personalized language learning programme that is worth its name needs to offer a wide variety of paths to accommodate the varying interests, priorities, levels and preferred approaches to learning of the users of the programme. For this to be possible, a huge quantity of learning material is needed (Iwata et al., 2011: 1): the preparation and curation of this material is extremely time-consuming and expensive (despite the pittance that is paid to writers and editors). It’s not surprising, then, that a growing amount of research is being devoted to the exploration of ways of automatically generating language learning material. One area that has attracted a lot of attention is the learning of vocabulary.

Memrise screenshot 2Many simple vocabulary learning tasks are relatively simple to generate automatically. These include matching tasks of various kinds, such as the matching of words or phrases to meanings (either in English or the L1), pictures or collocations, as in many flashcard apps. Doing it well is rather harder: the definitions or translations have to be good and appropriate for learners of the level, the pictures need to be appropriate. If, as is often the case, the lexical items have come from a text or form part of a group of some kind, sense disambiguation software will be needed to ensure that the right meaning is being practised. Anyone who has used flashcard apps knows that the major problem is usually the quality of the content (whether it has been automatically generated or written by someone).

A further challenge is the generation of distractors. In the example here (from Memrise), the distractors have been so badly generated as to render the task more or less a complete waste of time. Distractors must, in some way, be viable alternatives (Smith et al., 2010) but still clearly wrong. That means they should normally be the same part of speech and true cognates should be avoided. Research into the automatic generation of distractors is well-advanced (see, for instance, Kumar at al., 2015) with Smith et al (2010), for example, using a very large corpus and various functions of Sketch Engine (the most well-known corpus query tool) to find collocates and other distractors. Their TEDDCLOG (Testing English with Data-Driven CLOze Generation) system produced distractors that were deemed acceptable 91% of the time. Whilst impressive, there is still a long way to go before human editing / rewriting is no longer needed.

One area that has attracted attention is, of course, tests, and some tasks, such as those in TOEFL (see image). Susanti et al (2015, 2017) were able, given a target word, to automatically generate a reading passage from web sources along with questions of the TOEFL kind. However, only about half of them were considered good enough to be used in actual tests. Again, that is some way off avoiding human intervention altogether, but the automatically generated texts and questions can greatly facilitate the work of human item writers.

toefl task

 

Other tools that might be useful include the University of Nottingham AWL (Academic Word List) Gapmaker . This allows users to type or paste in a text, from which items from the AWL are extracted and replaced as a gap. See the example below. It would, presumably, not be too difficult, to combine this approach with automatic distractor generation and to create multiple choice tasks.

Nottingham_AWL_Gapmaster

WordGapThere are a number of applications that offer the possibility of generating cloze tasks from texts selected by the user (learner or teacher). These have not always been designed with the language learner in mind but one that was is the Android app, WordGap (Knoop & Wilske, 2013). Described by its developers as a tool that ‘provides highly individualized exercises to support contextualized mobile vocabulary learning …. It matches the interests of the learner and increases the motivation to learn’. It may well do all that, but then again, perhaps not. As Knoop & Wilske acknowledge, it is only appropriate for adult, advanced learners and its value as a learning task is questionable. The target item that has been automatically selected is ‘novel’, a word that features in the list Oxford 2000 Keywords (as do all three distractors), and therefore ought to be well below the level of the users. Some people might find this fun, but, in terms of learning, they would probably be better off using an app that made instant look-up of words in the text possible.

More interesting, in my view, is TEDDCLOG (Smith et al., 2010), a system that, given a target learning item (here the focus is on collocations), trawls a large corpus to find the best sentence that illustrates it. ‘Good sentences’ were defined as those which were short (but not too short, or there is not enough useful context, begins with a capital letter and ends with a full stop, has a maximum of two commas; and otherwise contains only the 26 lowercase letters. It must be at a lexical and grammatical level that an intermediate level learner of English could be expected to understand. It must be well-formed and without too much superfluous material. All others were rejected. TEDDCLOG uses Sketch Engine’s GDEX function (Good Dictionary Example Extractor, Kilgarriff et al 2008) to do this.

My own interest in this area came about as a result of my work in the development of the Oxford Vocabulary Trainer . The app offers the possibility of studying both pre-determined lexical items (e.g. the vocabulary list of a coursebook that the learner is using) and free choice (any item could be activated and sent to a learning queue). In both cases, practice takes the form of sentences with the target item gapped. There are a range of hints and help options available to the learner, and feedback is both automatic and formative (i.e. if the supplied answer is not correct, hints are given to push the learner to do better on a second attempt). Leveraging some fairly heavy technology, we were able to achieve a fair amount of success in the automation of intelligent feedback, but what had, at first sight, seemed a lesser challenge – the generation of suitable ‘carrier sentences’, proved more difficult.

The sentences which ‘carry’ the gap should, ideally, be authentic: invented examples often ‘do not replicate the phraseology and collocational preferences of naturally-occurring text’ (Smith et al., 2010). The technology of corpus search tools should allow us to do a better job than human item writers. For that to be the case, we need not only good search tools but a good corpus … and some are better than others for the purposes of language learning. As Fenogenova & Kuzmenko (2016) discovered when using different corpora to automatically generate multiple choice vocabulary exercises, the British Academic Written English corpus (BAWE) was almost 50% more useful than the British National Corpus (BNC). In the development of the Oxford Vocabulary Trainer, we thought we had the best corpus we could get our hands on – the tagged corpus used for the production of the Oxford suite of dictionaries. We could, in addition and when necessary, turn to other corpora, including the BAWE and the BNC. Our requirements for acceptable carrier sentences were similar to those of Smith et al (2010), but were considerably more stringent.

To cut quite a long story short, we learnt fairly quickly that we simply couldn’t automate the generation of carrier sentences with sufficient consistency or reliability. As with some of the other examples discussed in this post, we were able to use the technology to help the writers in their work. We also learnt (rather belatedly, it has to be admitted) that we were trying to find technological solutions to problems that we hadn’t adequately analysed at the start. We hadn’t, for example, given sufficient thought to learner differences, especially the role of L1 (and other languages) in learning English. We hadn’t thought enough about the ‘messiness’ of either language or language learning. It’s possible, given enough resources, that we could have found ways of improving the algorithms, of leveraging other tools, or of deploying additional databases (especially learner corpora) in our quest for a personalised vocabulary learning system. But, in the end, it became clear to me that we were only nibbling at the problem of vocabulary learning. Deliberate learning of vocabulary may be an important part of acquiring a language, but it remains only a relatively small part. Technology may be able to help us in a variety of ways (and much more so in testing than learning), but the dreams of the data scientists (who wrote much of the research cited here) are likely to be short-lived. Experienced writers and editors of learning materials will be needed for the foreseeable future. And truly personalized vocabulary learning, fully supported by technology, will not be happening any time soon.

 

References

Fenogenova, A. & Kuzmenko, E. 2016. Automatic Generation of Lexical Exercises Available online at http://www.dialog-21.ru/media/3477/fenogenova.pdf

Iwata, T., Goto, T., Kojiri, T., Watanabe, T. & T. Yamada. 2011. ‘Automatic Generation of English Cloze Questions Based on Machine Learning’. NTT Technical Review Vol. 9 No. 10 Oct. 2011

Kilgarriff, A. et al. 2008. ‘GDEX: Automatically Finding Good Dictionary Examples in a Corpus.’ In E. Bernal and J. DeCesaris (eds.), Proceedings of the XIII EURALEX International Congress: Barcelona, 15-19 July 2008. Barcelona: l’Institut Universitari de Lingüística Aplicada (IULA) dela Universitat Pompeu Fabra, 425–432.

Knoop, S. & Wilske, S. 2013. ‘WordGap – Automatic generation of gap-filling vocabulary exercises for mobile learning’. Proceedings of the second workshop on NLP for computer-assisted language learning at NODALIDA 2013. NEALT Proceedings Series 17 / Linköping Electronic Conference Proceedings 86: 39–47. Available online at http://www.ep.liu.se/ecp/086/004/ecp13086004.pdf

Kumar, G., Banchs, R.E. & D’Haro, L.F. 2015. ‘RevUP: Automatic Gap-Fill Question Generation from Educational Texts’. Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 2015, pp. 154–161, Denver, Colorado, June 4, Association for Computational Linguistics

Smith, S., Avinesh, P.V.S. & Kilgariff, A. 2010. ‘Gap-fill tests for Language Learners: Corpus-Driven Item Generation’. Proceedings of ICON-2010: 8th International Conference on Natural Language Processing, Macmillan Publishers, India. Available online at https://curve.coventry.ac.uk/open/file/2b755b39-a0fa-4171-b5ae-5d39568874e5/1/smithcomb2.pdf

Susanti, Y., Iida, R. & Tokunaga, T. 2015. ‘Automatic Generation of English Vocabulary Tests’. Proceedings of 7th International Conference on Computer Supported Education. Available online https://pdfs.semanticscholar.org/aead/415c1e07803756902b859e8b6e47ce312d96.pdf

Susanti, Y., Tokunaga, T., Nishikawa, H. & H. Obari 2017. ‘Evaluation of automatically generated English vocabulary questions’ Research and Practice in Technology Enhanced Learning 12 / 11

 

Every now and then, someone recommends me to take a look at a flashcard app. It’s often interesting to see what developers have done with design, gamification and UX features, but the content is almost invariably awful. Most recently, I was encouraged to look at Word Pash. The screenshots below are from their promotional video.

word-pash-1 word-pash-2 word-pash-3 word-pash-4

The content problems are immediately apparent: an apparently random selection of target items, an apparently random mix of high and low frequency items, unidiomatic language examples, along with definitions and distractors that are less frequent than the target item. I don’t know if these are representative of the rest of the content. The examples seem to come from ‘Stage 1 Level 3’, whatever that means. (My confidence in the product was also damaged by the fact that the Word Pash website includes one testimonial from a certain ‘Janet Reed – Proud Mom’, whose son ‘was able to increase his score and qualify for academic scholarships at major universities’ after using the app. The picture accompanying ‘Janet Reed’ is a free stock image from Pexels and ‘Janet Reed’ is presumably fictional.)

According to the website, ‘WordPash is a free-to-play mobile app game for everyone in the global audience whether you are a 3rd grader or PhD, wordbuff or a student studying for their SATs, foreign student or international business person, you will become addicted to this fast paced word game’. On the basis of the promotional video, the app couldn’t be less appropriate for English language learners. It seems unlikely that it would help anyone improve their ACT or SAT test scores. The suggestion that the vocabulary development needs of 9-year-olds and doctoral students are comparable is pure chutzpah.

The deliberate study of more or less random words may be entertaining, but it’s unlikely to lead to very much in practical terms. For general purposes, the deliberate learning of the highest frequency words, up to about a frequency ranking of #7500, makes sense, because there’s a reasonably high probability that you’ll come across these items again before you’ve forgotten them. Beyond that frequency level, the value of the acquisition of an additional 1000 words tails off very quickly. Adding 1000 words from frequency ranking #8000 to #9000 is likely to result in an increase in lexical understanding of general purpose texts of about 0.2%. When we get to frequency ranks #19,000 to #20,000, the gain in understanding decreases to 0.01%[1]. In other words, deliberate vocabulary learning needs to be targeted. The data is relatively recent, but the principle goes back to at least the middle of the last century when Michael West argued that a principled approach to vocabulary development should be driven by a comparison of the usefulness of a word and its ‘learning cost’[2]. Three hundred years before that, Comenius had articulated something very similar: ‘in compiling vocabularies, my […] concern was to select the words in most frequent use[3].

I’ll return to ‘general purposes’ later in this post, but, for now, we should remember that very few language learners actually study a language for general purposes. Globally, the vast majority of English language learners study English in an academic (school) context and their immediate needs are usually exam-specific. For them, general purpose frequency lists are unlikely to be adequate. If they are studying with a coursebook and are going to be tested on the lexical content of that book, they will need to use the wordlist that matches the book. Increasingly, publishers make such lists available and content producers for vocabulary apps like Quizlet and Memrise often use them. Many examinations, both national and international, also have accompanying wordlists. Examples of such lists produced by examination boards include the Cambridge English young learners’ exams (Starters, Movers and Flyers) and Cambridge English Preliminary. Other exams do not have official word lists, but reasonably reliable lists have been produced by third parties. Examples include Cambridge First, IELTS and SAT. There are, in addition, well-researched wordlists for academic English, including the Academic Word List (AWL)  and the Academic Vocabulary List  (AVL). All of these make sensible starting points for deliberate vocabulary learning.

When we turn to other, out-of-school learners the number of reasons for studying English is huge. Different learners have different lexical needs, and working with a general purpose frequency list may be, at least in part, a waste of time. EFL and ESL learners are likely to have very different needs, as will EFL and ESP learners, as will older and younger learners, learners in different parts of the world, learners who will find themselves in English-speaking countries and those who won’t, etc., etc. For some of these demographics, specialised corpora (from which frequency-based wordlists can be drawn) exist. For most learners, though, the ideal list simply does not exist. Either it will have to be created (requiring a significant amount of time and expertise[4]) or an available best-fit will have to suffice. Paul Nation, in his recent ‘Making and Using Word Lists for Language Learning and Testing’ (John Benjamins, 2016) includes a useful chapter on critiquing wordlists. For anyone interested in better understanding the issues surrounding the development and use of wordlists, three good articles are freely available online. These are:making-and-using-word-lists-for-language-learning-and-testing

Lessard-Clouston, M. 2012 / 2013. ‘Word Lists for Vocabulary Learning and Teaching’ The CATESOL Journal 24.1: 287- 304

Lessard-Clouston, M. 2016. ‘Word lists and vocabulary teaching: options and suggestions’ Cornerstone ESL Conference 2016

Sorell, C. J. 2013. A study of issues and techniques for creating core vocabulary lists for English as an International Language. Doctoral thesis.

But, back to ‘general purposes’ …. Frequency lists are the obvious starting point for preparing a wordlist for deliberate learning, but they are very problematic. Frequency rankings depend on the corpus on which they are based and, since these are different, rankings vary from one list to another. Even drawing on just one corpus, rankings can be a little strange. In the British National Corpus, for example, ‘May’ (the month) is about twice as frequent as ‘August’[5], but we would be foolish to infer from this that the learning of ‘May’ should be prioritised over the learning of ‘August’. An even more striking example from the same corpus is the fact that ‘he’ is about twice as frequent as ‘she’[6]: should, therefore, ‘he’ be learnt before ‘she’?

List compilers have to make a number of judgement calls in their work. There is not space here to consider these in detail, but two particularly tricky questions concerning the way that words are chosen may be mentioned: Is a verb like ‘list’, with two different and unrelated meanings, one word or two? Should inflected forms be considered as separate words? The judgements are not usually informed by considerations of learners’ needs. Learners will probably best approach vocabulary development by building their store of word senses: attempting to learn all the meanings and related forms of any given word is unlikely to be either useful or successful.

Frequency lists, in other words, are not statements of scientific ‘fact’: they are interpretative documents. They have been compiled for descriptive purposes, not as ways of structuring vocabulary learning, and it cannot be assumed they will necessarily be appropriate for a purpose for which they were not designed.

A further major problem concerns the corpus on which the frequency list is based. Large databases, such as the British National Corpus or the Corpus of Contemporary American English, are collections of language used by native speakers in certain parts of the world, usually of a restricted social class. As such, they are of relatively little value to learners who will be using English in contexts that are not covered by the corpus. A context where English is a lingua franca is one such example.

A different kind of corpus is the Cambridge Learner Corpus (CLC), a collection of exam scripts produced by candidates in Cambridge exams. This has led to the development of the English Vocabulary Profile (EVP) , where word senses are tagged as corresponding to particular levels in the Common European Framework scale. At first glance, this looks like a good alternative to frequency lists based on native-speaker corpora. But closer consideration reveals many problems. The design of examination tasks inevitably results in the production of language of a very different kind from that produced in other contexts. Many high frequency words simply do not appear in the CLC because it is unlikely that a candidate would use them in an exam. Other items are very frequent in this corpus just because they are likely to be produced in examination tasks. Unsurprisingly, frequency rankings in EVP do not correlate very well with frequency rankings from other corpora. The EVP, then, like other frequency lists, can only serve, at best, as a rough guide for the drawing up of target item vocabulary lists in general purpose apps or coursebooks[7].

There is no easy solution to the problems involved in devising suitable lexical content for the ‘global audience’. Tagging words to levels (i.e. grouping them into frequency bands) will always be problematic, unless very specific user groups are identified. Writers, like myself, of general purpose English language teaching materials are justifiably irritated by some publishers’ insistence on allocating words to levels with numerical values. The policy, taken to extremes (as is increasingly the case), has little to recommend it in linguistic terms. But it’s still a whole lot better than the aleatory content of apps like Word Pash.

[1] See Nation, I.S.P. 2013. Learning Vocabulary in Another Language 2nd edition. (Cambridge: Cambridge University Press) p. 21 for statistical tables. See also Nation, P. & R. Waring 1997. ‘Vocabulary size, text coverage and word lists’ in Schmitt & McCarthy (eds.) 1997. Vocabulary: Description, Acquisition and Pedagogy. (Cambridge: Cambridge University Press) pp. 6 -19

[2] See Kelly, L.G. 1969. 25 Centuries of Language Teaching. (Rowley, Mass.: Rowley House) p.206 for a discussion of West’s ideas.

[3] Kelly, L.G. 1969. 25 Centuries of Language Teaching. (Rowley, Mass.: Rowley House) p. 184

[4] See Timmis, I. 2015. Corpus Linguistics for ELT (Abingdon: Routledge) for practical advice on doing this.

[5] Nation, I.S.P. 2016. Making and Using Word Lists for Language Learning and Testing. (Amsterdam: John Benjamins) p.58

[6] Taylor, J.R. 2012. The Mental Corpus. (Oxford: Oxford University Press) p.151

[7] For a detailed critique of the limitations of using the CLC as a guide to syllabus design and textbook development, see Swan, M. 2014. ‘A Review of English Profile Studies’ ELTJ 68/1: 89-96

In December last year, I posted a wish list for vocabulary (flashcard) apps. At the time, I hadn’t read a couple of key research texts on the subject. It’s time for an update.

First off, there’s an article called ‘Intentional Vocabulary Learning Using Digital Flashcards’ by Hsiu-Ting Hung. It’s available online here. Given the lack of empirical research into the use of digital flashcards, it’s an important article and well worth a read. Its basic conclusion is that digital flashcards are more effective as a learning tool than printed word lists. No great surprises there, but of more interest, perhaps, are the recommendations that (1) ‘students should be educated about the effective use of flashcards (e.g. the amount and timing of practice), and this can be implemented through explicit strategy instruction in regular language courses or additional study skills workshops ‘ (Hung, 2015: 111), and (2) that digital flashcards can be usefully ‘repurposed for collaborative learning tasks’ (Hung, ibid.).

nakataHowever, what really grabbed my attention was an article by Tatsuya Nakata. Nakata’s research is of particular interest to anyone interested in vocabulary learning, but especially so to those with an interest in digital possibilities. A number of his research articles can be freely accessed via his page at ResearchGate, but the one I am interested in is called ‘Computer-assisted second language vocabulary learning in a paired-associate paradigm: a critical investigation of flashcard software’. Don’t let the title put you off. It’s a review of a pile of web-based flashcard programs: since the article is already five years old, many of the programs have either changed or disappeared, but the critical approach he takes is more or less as valid now as it was then (whether we’re talking about web-based stuff or apps).

Nakata divides his evaluation for criteria into two broad groups.

Flashcard creation and editing

(1) Flashcard creation: Can learners create their own flashcards?

(2) Multilingual support: Can the target words and their translations be created in any language?

(3) Multi-word units: Can flashcards be created for multi-word units as well as single words?

(4) Types of information: Can various kinds of information be added to flashcards besides the word meanings (e.g. parts of speech, contexts, or audios)?

(5) Support for data entry: Does the software support data entry by automatically supplying information about lexical items such as meaning, parts of speech, contexts, or frequency information from an internal database or external resources?

(6) Flashcard set: Does the software allow learners to create their own sets of flashcards?

Learning

(1) Presentation mode: Does the software have a presentation mode, where new items are introduced and learners familiarise themselves with them?

(2) Retrieval mode: Does the software have a retrieval mode, which asks learners to recall or choose the L2 word form or its meaning?

(3) Receptive recall: Does the software ask learners to produce the meanings of target words?

(4) Receptive recognition: Does the software ask learners to choose the meanings of target words?

(5) Productive recall: Does the software ask learners to produce the target word forms corresponding to the meanings provided?

(6) Productive recognition: Does the software ask learners to choose the target word forms corresponding to the meanings provided?

(7) Increasing retrieval effort: For a given item, does the software arrange exercises in the order of increasing difficulty?

(8) Generative use: Does the software encourage generative use of words, where learners encounter or use previously met words in novel contexts?

(9) Block size: Can the number of words studied in one learning session be controlled and altered?

(10) Adaptive sequencing: Does the software change the sequencing of items based on learners’ previous performance on individual items?

(11) Expanded rehearsal: Does the software help implement expanded rehearsal, where the intervals between study trials are gradually increased as learning proceeds? (Nakata, T. (2011): ‘Computer-assisted second language vocabulary learning in a paired-associate paradigm: a critical investigation of flashcard software’ Computer Assisted Language Learning, 24:1, 17-38)

It’s a rather different list from my own (there’s nothing I would disagree with here), because mine is more general and his is exclusively oriented towards learning principles. Nakata makes the point towards the end of the article that it would ‘be useful to investigate learners’ reactions to computer-based flashcards to examine whether they accept flashcard programs developed according to learning principles’ (p. 34). It’s far from clear, he points out, that conformity to learning principles are at the top of learners’ agendas. More than just users’ feelings about computer-based flashcards in general, a key concern will be the fact that there are ‘large individual differences in learners’ perceptions of [any flashcard] program’ (Nakata, N. 2008. ‘English vocabulary learning with word lists, word cards and computers: implications from cognitive psychology research for optimal spaced learning’ ReCALL 20(1), p. 18).

I was trying to make a similar point in another post about motivation and vocabulary apps. In the end, as with any language learning material, research-driven language learning principles can only take us so far. User experience is a far more difficult creature to pin down or to make generalisations about. A user’s reaction to graphics, gamification, uploading time and so on are so powerful and so subjective that learning principles will inevitably play second fiddle. That’s not to say, of course, that Nakata’s questions are not important: it’s merely to wonder whether the bigger question is truly answerable.

Nakata’s research identifies plenty of room for improvement in digital flashcards, and although the article is now quite old, not a lot had changed. Key areas to work on are (1) the provision of generative use of target words, (2) the need to increase retrieval effort, (3) the automatic provision of information about meaning, parts of speech, or contexts (in order to facilitate flashcard creation), and (4) the automatic generation of multiple-choice distractors.

In the conclusion of his study, he identifies one flashcard program which is better than all the others. Unsurprisingly, five years down the line, the software he identifies is no longer free, others have changed more rapidly in the intervening period, and who knows will be out in front next week?