Posts Tagged ‘cloze’

Knowble, claims its developers, is a browser extension that will improve English vocabulary and reading comprehension. It also describes itself as an ‘adaptive language learning solution for publishers’. It’s currently beta and free, and sounds right up my street so I decided to give it a run.

Knowble reader

Users are asked to specify a first language (I chose French) and a level (A1 to C2): I chose B1, but this did not seem to impact on anything that subsequently happened. They are then offered a menu of about 30 up-to-date news items, grouped into 5 categories (world, science, business, sport, entertainment). Clicking on one article takes you to the article on the source website. There’s a good selection, including USA Today, CNN, Reuters, the Independent and the Torygraph from Britain, the Times of India, the Independent from Ireland and the Star from Canada. A large number of words are underlined: a single click brings up a translation in the extension box. Double-clicking on all other words will also bring up translations. Apart from that, there is one very short exercise (which has presumably been automatically generated) for each article.

For my trial run, I picked three articles: ‘Woman asks firefighters to help ‘stoned’ raccoon’ (from the BBC, 240 words), ‘Plastic straw and cotton bud ban proposed’ (also from the BBC, 823 words) and ‘London’s first housing market slump since 2009 weighs on UK price growth’ (from the Torygraph, 471 words).

Translations

Research suggests that the use of translations, rather than definitions, may lead to more learning gains, but the problem with Knowble is that it relies entirely on Google Translate. Google Translate is fast improving. Take the first sentence of the ‘plastic straw and cotton bud’ article, for example. It’s not a bad translation, but it gets the word ‘bid’ completely wrong, translating it as ‘offre’ (= offer), where ‘tentative’ (= attempt) is needed. So, we can still expect a few problems with Google Translate …

google_translateOne of the reasons that Google Translate has improved is that it no longer treats individual words as individual lexical items. It analyses groups of words and translates chunks or phrases (see, for example, the way it translates ‘as part of’). It doesn’t do word-for-word translation. Knowble, however, have set their software to ask Google for translations of each word as individual items, so the phrase ‘as part of’ is translated ‘comme’ + ‘partie’ + ‘de’. Whilst this example is comprehensible, problems arise very quickly. ‘Cotton buds’ (‘cotons-tiges’) become ‘coton’ + ‘bourgeon’ (= botanical shoots of cotton). Phrases like ‘in time’, ‘run into’, ‘sleep it off’ ‘take its course’, ‘fire station’ or ‘going on’ (all from the stoned raccoon text) all cause problems. In addition, Knowble are not using any parsing tools, so the system does not identify parts of speech, and further translation errors inevitably appear. In the short article of 240 words, about 10% are wrongly translated. Knowble claim to be using NLP tools, but there’s no sign of it here. They’re just using Google Translate rather badly.

Highlighted items

word_listNLP tools of some kind are presumably being used to select the words that get underlined. Exactly how this works is unclear. On the whole, it seems that very high frequency words are ignored and that lower frequency words are underlined. Here, for example, is the list of words that were underlined in the stoned raccoon text. I’ve compared them with (1) the CEFR levels for these words in the English Profile Text Inspector, and (2) the frequency information from the Macmillan dictionary (more stars = more frequent). In the other articles, some extremely high frequency words were underlined (e.g. price, cost, year) while much lower frequency items were not.

It is, of course, extremely difficult to predict which items of vocabulary a learner will know, even if we have a fairly accurate idea of their level. Personal interests play a significant part, so, for example, some people at even a low level will have no problem with ‘cannabis’, ‘stoned’ and ‘high’, even if these are low frequency. First language, however, is a reasonably reliable indicator as cognates can be expected to be easy. A French speaker will have no problem with ‘appreciate’, ‘unique’ and ‘symptom’. A recommendation engine that can meaningfully personalize vocabulary suggestions will, at the very least, need to consider cognates.

In short, the selection and underlining of vocabulary items, as it currently stands in Knowble, appears to serve no clear or useful function.

taskVocabulary learning

Knowble offers a very short exercise for each article. They are of three types: word completion, dictation and drag and drop (see the example). The rationale for the selection of the target items is unclear, but, in any case, these exercises are tokenistic in the extreme and are unlikely to lead to any significant learning gains. More valuable would be the possibility of exporting items into a spaced repetition flash card system.

effectiveThe claim that Knowble’s ‘learning effect is proven scientifically’ seems to me to be without any foundation. If there has been any proper research, it’s not signposted anywhere. Sure, reading lots of news articles (with a look-up function – if it works reliably) can only be beneficial for language learners, but they can do that with any decent dictionary running in the background.

Similar in many ways to en.news, which I reviewed in my last post, Knowble is another example of a technology-driven product that shows little understanding of language learning.

Advertisements

Last month, I wrote a post about the automated generation of vocabulary learning materials. Yesterday, I got an email from Mike Elchik, inviting me to take a look at the product that his company, WeSpeke, has developed in partnership with CNN. Called en.news, it’s a very regularly updated and wide selection of video clips and texts from CNN, which are then used to ‘automatically create a pedagogically structured, leveled and game-ified English lesson‘. Available at the AppStore and Google Play, as well as a desktop version, it’s free. Revenues will presumably be generated through advertising and later sales to corporate clients.

With 6.2 million dollars in funding so far, WeSpeke can leverage some state-of-the-art NLP and AI tools. Co-founder and chief technical adviser of the company is Jaime Carbonell, Director of the Language Technologies Institute at Carnegie Mellon University, described in Wikipedia as one of the gurus of machine learning. I decided to have a closer look.

home_page

Users are presented with a menu of CNN content (there were 38 items from yesterday alone), these are tagged with broad categories (Politics, Opinions, Money, Technology, Entertainment, etc.) and given a level, ranging from 1 to 5, although the vast majority of the material is at the two highest levels.

menu.jpg

I picked two lessons: a reading text about Mark Zuckerberg’s Congressional hearing (level 5) and a 9 minute news programme of mixed items (level 2 – illustrated above). In both cases, the lesson begins with the text. With the reading, you can click on words to bring up dictionary entries from the Collins dictionary. With the video, you can activate captions and again click on words for definitions. You can also slow down the speed. So far, so good.

There then follows a series of exercises which focus primarily on a set of words that have been automatically selected. This is where the problems began.

Level

It’s far from clear what the levels (1 – 5) refer to. The Zuckerberg text is 930 words long and is rated as B2 by one readability tool. But, using the English Profile Text Inspector, there are 19 types at C1 level, 14 at C2, and 98 which are unlisted. That suggests something substantially higher than B2. The CNN10 video is delivered at breakneck speed (as is often the case with US news shows). Yes, it can be slowed down, but that still won’t help with some passages, such as the one below:

A squirrel recently fell out of a tree in Western New York. Why would that make news?Because she bwoke her widdle leg and needed a widdle cast! Yes, there are casts for squirrels, as you can see in this video from the Orphaned Wildlife Center. A windstorm knocked the animal’s nest out of a tree, and when a woman saw that the baby squirrel was injured, she took her to a local vet. Doctors say she’s going to be just fine in a couple of weeks. Well, why ‘rodent’ she be? She’s been ‘whiskered’ away and cast in both a video and a plaster. And as long as she doesn’t get too ‘squirrelly’ before she heals, she’ll have quite a ‘tail’ to tell.

It’s hard to understand how a text like this got through the algorithms. But, as materials writers know, it is extremely hard to find authentic text that lends itself to language learning at anything below C1. On the evidence here, there is still some way to go before the process of selection can be automated. It may well be the case that CNN simply isn’t a particularly appropriate source.

Target learning items

The primary focus of these lessons is vocabulary learning, and it’s vocabulary learning of a very deliberate kind. Applied linguists are in general agreement that it makes sense for learners to approach the building of their L2 lexicon in a deliberate way (i.e. by studying individual words) for high-frequency items or items that can be identified as having a high surrender value (e.g. items from the AWL for students studying in an EMI context). Once you get to items that are less frequent than, say, the top 8,000 most frequent words, the effort expended in studying new words needs to be offset against their usefulness. Why spend a lot of time studying low frequency words when you’re unlikely to come across them again for some time … and will probably forget them before you do? Vocabulary development at higher levels is better served by extensive reading (and listening), possibly accompanied by glosses.

The target items in the Zuckerberg text were: advocacy, grilled, handicapping, sparked, diagnose, testified, hefty, imminent, deliberative and hesitant. One of these ‘grilled‘ is listed as A2 by English Vocabulary Profile, but that is with its literal, not metaphorical, meaning. Four of them are listed as C2 and the remaining five are off-list. In the CNN10 video, the target items were: strive, humble (verb), amplify, trafficked, enslaved, enacted, algae, trafficking, ink and squirrels. Of these, one is B1, two are C2 and the rest are unlisted. What is the point of studying these essentially random words? Why spend time going through a series of exercises that practise these items? Wouldn’t your time be better spent just doing some more reading? I have no idea how the automated selection of these items takes place, but it’s clear that it’s not working very well.

Practice exercises

There is plenty of variety of task-type but there are,  I think, two reasons to query the claim that these lessons are ‘pedagogically structured’. The first is the nature of the practice exercises; the second is the sequencing of the exercises. I’ll restrict my observations to a selection of the tasks.

1. Users are presented with a dictionary definition and an anagrammed target item which they must unscramble. For example:

existing for the purpose of discussing or planning something     VLREDBETEIIA

If you can’t solve the problem, you can always scroll through the text to find the answer. Burt the problem is in the task design. Dictionary definitions have been written to help language users decode a word. They simply don’t work very well when they are used for another purpose (as prompts for encoding).

2. Users are presented with a dictionary definition for which they must choose one of four words. There are many potential problems here, not the least of which is that definitions are often more complex than the word they are defining, or they present other challenges. As an example: cause to be unpretentious for to humble. On top of that, lexicographers often need or choose to embed the target item in the definition. For example:

a hefty amount of something, especially money, is very large

an event that is imminent, especially an unpleasant one, will happen very soon

When this is the case, it makes no sense to present these definitions and ask learners to find the target item from a list of four.

The two key pieces of content in this product – the CNN texts and the Collins dictionaries – are both less than ideal for their purposes.

3. Users are presented with a box of jumbled words which they must unscramble to form sentences that appeared in the text.

Rearrange_words_to_make_sentences

The sentences are usually long and hard to reconstruct. You can scroll through the text to find the answer, but I’m unclear what the point of this would be. The example above contains a mistake (vie instead of vice), but this was one of only two glitches I encountered.

4. Users are asked to select the word that they hear on an audio recording. For example:

squirreling     squirrel     squirreled     squirrels

Given the high level of challenge of both the text and the target items, this was a rather strange exercise to kick off the practice. The meaning has not yet been presented (in a matching / definition task), so what exactly is the point of this exercise?

5. Users are presented with gapped sentences from the text and asked to choose the correct grammatical form of the missing word. Some of these were hard (e.g. adjective order), others were very easy (e.g. some vs any). The example below struck me as plain weird for a lesson at this level.

________ have zero expectation that this Congress is going to make adequate changes. (I or Me ?)

6. At the end of both lessons, there were a small number of questions that tested your memory of the text. If, like me, you couldn’t remember all that much about the text after twenty minutes of vocabulary activities, you can scroll through the text to find the answers. This is not a task type that will develop reading skills: I am unclear what it could possibly develop.

Overall?

Using the lessons on offer here wouldn’t do a learner (as long as they already had a high level of proficiency) any harm, but it wouldn’t be the most productive use of their time, either. If a learner is motivated to read the text about Zuckerberg, rather than do lots of ‘busy’ work on a very odd set of words with gap-fills and matching tasks, they’d be better advised just to read the text again once or twice. They could use a look-up for words they want to understand and import them into a flashcard system with spaced repetition (en.news does have flashcards, but there’s no sign of spaced practice yet). More, they could check out another news website and read / watch other articles on the same subject (perhaps choosing websites with a different slant to CNN) and get valuable narrow-reading practice in this way.

My guess is that the technology has driven the product here, but without answering the fundamental questions about which words it’s appropriate for individual learners to study in a deliberate way and how this is best tackled, it doesn’t take learners very far.

 

 

 

 

A personalized language learning programme that is worth its name needs to offer a wide variety of paths to accommodate the varying interests, priorities, levels and preferred approaches to learning of the users of the programme. For this to be possible, a huge quantity of learning material is needed (Iwata et al., 2011: 1): the preparation and curation of this material is extremely time-consuming and expensive (despite the pittance that is paid to writers and editors). It’s not surprising, then, that a growing amount of research is being devoted to the exploration of ways of automatically generating language learning material. One area that has attracted a lot of attention is the learning of vocabulary.

Memrise screenshot 2Many simple vocabulary learning tasks are relatively simple to generate automatically. These include matching tasks of various kinds, such as the matching of words or phrases to meanings (either in English or the L1), pictures or collocations, as in many flashcard apps. Doing it well is rather harder: the definitions or translations have to be good and appropriate for learners of the level, the pictures need to be appropriate. If, as is often the case, the lexical items have come from a text or form part of a group of some kind, sense disambiguation software will be needed to ensure that the right meaning is being practised. Anyone who has used flashcard apps knows that the major problem is usually the quality of the content (whether it has been automatically generated or written by someone).

A further challenge is the generation of distractors. In the example here (from Memrise), the distractors have been so badly generated as to render the task more or less a complete waste of time. Distractors must, in some way, be viable alternatives (Smith et al., 2010) but still clearly wrong. That means they should normally be the same part of speech and true cognates should be avoided. Research into the automatic generation of distractors is well-advanced (see, for instance, Kumar at al., 2015) with Smith et al (2010), for example, using a very large corpus and various functions of Sketch Engine (the most well-known corpus query tool) to find collocates and other distractors. Their TEDDCLOG (Testing English with Data-Driven CLOze Generation) system produced distractors that were deemed acceptable 91% of the time. Whilst impressive, there is still a long way to go before human editing / rewriting is no longer needed.

One area that has attracted attention is, of course, tests, and some tasks, such as those in TOEFL (see image). Susanti et al (2015, 2017) were able, given a target word, to automatically generate a reading passage from web sources along with questions of the TOEFL kind. However, only about half of them were considered good enough to be used in actual tests. Again, that is some way off avoiding human intervention altogether, but the automatically generated texts and questions can greatly facilitate the work of human item writers.

toefl task

 

Other tools that might be useful include the University of Nottingham AWL (Academic Word List) Gapmaker . This allows users to type or paste in a text, from which items from the AWL are extracted and replaced as a gap. See the example below. It would, presumably, not be too difficult, to combine this approach with automatic distractor generation and to create multiple choice tasks.

Nottingham_AWL_Gapmaster

WordGapThere are a number of applications that offer the possibility of generating cloze tasks from texts selected by the user (learner or teacher). These have not always been designed with the language learner in mind but one that was is the Android app, WordGap (Knoop & Wilske, 2013). Described by its developers as a tool that ‘provides highly individualized exercises to support contextualized mobile vocabulary learning …. It matches the interests of the learner and increases the motivation to learn’. It may well do all that, but then again, perhaps not. As Knoop & Wilske acknowledge, it is only appropriate for adult, advanced learners and its value as a learning task is questionable. The target item that has been automatically selected is ‘novel’, a word that features in the list Oxford 2000 Keywords (as do all three distractors), and therefore ought to be well below the level of the users. Some people might find this fun, but, in terms of learning, they would probably be better off using an app that made instant look-up of words in the text possible.

More interesting, in my view, is TEDDCLOG (Smith et al., 2010), a system that, given a target learning item (here the focus is on collocations), trawls a large corpus to find the best sentence that illustrates it. ‘Good sentences’ were defined as those which were short (but not too short, or there is not enough useful context, begins with a capital letter and ends with a full stop, has a maximum of two commas; and otherwise contains only the 26 lowercase letters. It must be at a lexical and grammatical level that an intermediate level learner of English could be expected to understand. It must be well-formed and without too much superfluous material. All others were rejected. TEDDCLOG uses Sketch Engine’s GDEX function (Good Dictionary Example Extractor, Kilgarriff et al 2008) to do this.

My own interest in this area came about as a result of my work in the development of the Oxford Vocabulary Trainer . The app offers the possibility of studying both pre-determined lexical items (e.g. the vocabulary list of a coursebook that the learner is using) and free choice (any item could be activated and sent to a learning queue). In both cases, practice takes the form of sentences with the target item gapped. There are a range of hints and help options available to the learner, and feedback is both automatic and formative (i.e. if the supplied answer is not correct, hints are given to push the learner to do better on a second attempt). Leveraging some fairly heavy technology, we were able to achieve a fair amount of success in the automation of intelligent feedback, but what had, at first sight, seemed a lesser challenge – the generation of suitable ‘carrier sentences’, proved more difficult.

The sentences which ‘carry’ the gap should, ideally, be authentic: invented examples often ‘do not replicate the phraseology and collocational preferences of naturally-occurring text’ (Smith et al., 2010). The technology of corpus search tools should allow us to do a better job than human item writers. For that to be the case, we need not only good search tools but a good corpus … and some are better than others for the purposes of language learning. As Fenogenova & Kuzmenko (2016) discovered when using different corpora to automatically generate multiple choice vocabulary exercises, the British Academic Written English corpus (BAWE) was almost 50% more useful than the British National Corpus (BNC). In the development of the Oxford Vocabulary Trainer, we thought we had the best corpus we could get our hands on – the tagged corpus used for the production of the Oxford suite of dictionaries. We could, in addition and when necessary, turn to other corpora, including the BAWE and the BNC. Our requirements for acceptable carrier sentences were similar to those of Smith et al (2010), but were considerably more stringent.

To cut quite a long story short, we learnt fairly quickly that we simply couldn’t automate the generation of carrier sentences with sufficient consistency or reliability. As with some of the other examples discussed in this post, we were able to use the technology to help the writers in their work. We also learnt (rather belatedly, it has to be admitted) that we were trying to find technological solutions to problems that we hadn’t adequately analysed at the start. We hadn’t, for example, given sufficient thought to learner differences, especially the role of L1 (and other languages) in learning English. We hadn’t thought enough about the ‘messiness’ of either language or language learning. It’s possible, given enough resources, that we could have found ways of improving the algorithms, of leveraging other tools, or of deploying additional databases (especially learner corpora) in our quest for a personalised vocabulary learning system. But, in the end, it became clear to me that we were only nibbling at the problem of vocabulary learning. Deliberate learning of vocabulary may be an important part of acquiring a language, but it remains only a relatively small part. Technology may be able to help us in a variety of ways (and much more so in testing than learning), but the dreams of the data scientists (who wrote much of the research cited here) are likely to be short-lived. Experienced writers and editors of learning materials will be needed for the foreseeable future. And truly personalized vocabulary learning, fully supported by technology, will not be happening any time soon.

 

References

Fenogenova, A. & Kuzmenko, E. 2016. Automatic Generation of Lexical Exercises Available online at http://www.dialog-21.ru/media/3477/fenogenova.pdf

Iwata, T., Goto, T., Kojiri, T., Watanabe, T. & T. Yamada. 2011. ‘Automatic Generation of English Cloze Questions Based on Machine Learning’. NTT Technical Review Vol. 9 No. 10 Oct. 2011

Kilgarriff, A. et al. 2008. ‘GDEX: Automatically Finding Good Dictionary Examples in a Corpus.’ In E. Bernal and J. DeCesaris (eds.), Proceedings of the XIII EURALEX International Congress: Barcelona, 15-19 July 2008. Barcelona: l’Institut Universitari de Lingüística Aplicada (IULA) dela Universitat Pompeu Fabra, 425–432.

Knoop, S. & Wilske, S. 2013. ‘WordGap – Automatic generation of gap-filling vocabulary exercises for mobile learning’. Proceedings of the second workshop on NLP for computer-assisted language learning at NODALIDA 2013. NEALT Proceedings Series 17 / Linköping Electronic Conference Proceedings 86: 39–47. Available online at http://www.ep.liu.se/ecp/086/004/ecp13086004.pdf

Kumar, G., Banchs, R.E. & D’Haro, L.F. 2015. ‘RevUP: Automatic Gap-Fill Question Generation from Educational Texts’. Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 2015, pp. 154–161, Denver, Colorado, June 4, Association for Computational Linguistics

Smith, S., Avinesh, P.V.S. & Kilgariff, A. 2010. ‘Gap-fill tests for Language Learners: Corpus-Driven Item Generation’. Proceedings of ICON-2010: 8th International Conference on Natural Language Processing, Macmillan Publishers, India. Available online at https://curve.coventry.ac.uk/open/file/2b755b39-a0fa-4171-b5ae-5d39568874e5/1/smithcomb2.pdf

Susanti, Y., Iida, R. & Tokunaga, T. 2015. ‘Automatic Generation of English Vocabulary Tests’. Proceedings of 7th International Conference on Computer Supported Education. Available online https://pdfs.semanticscholar.org/aead/415c1e07803756902b859e8b6e47ce312d96.pdf

Susanti, Y., Tokunaga, T., Nishikawa, H. & H. Obari 2017. ‘Evaluation of automatically generated English vocabulary questions’ Research and Practice in Technology Enhanced Learning 12 / 11