Word Frequency Issues for Language Learners

Peter and Jane book 1a Play with us
One well-known and practical application of word frequency analysis, but is the word list still current?

I wanted to address some issues with the whole question of using word frequency analysis when learning languages. It is obviously a good idea to use frequency studies if they are available.

They can function both as checklists to ensure that our courses (whether we are users or compilers of courses) contain a complete coverage of the most frequent words. I do think that a course which claims to give 2,000 words lets its users down if a noticeable percentage of say the top 1000 words are still missing, whereas students are expected to learn words for wainscotting, walruses and woodpeckers.

One big problem with frequency dictionary analysis and word count – especially when comparing between languages or methods – is what does it mean? If we are talking about uninflected languages then the number of individual words is shorter. The original poster referred to Italian, and here it is a problem, because every single verb has umpteen forms, so is that one word or is that umpteen words?

If you use a machine to collate the frequency, then “has”, “have” and “had” will all show up as different words. Should it be three or one? In those cases where a noun has 12 separate declined forms in Czech or Polish, is that 12 words or one word? Or is it something in between, with the forms guessable out of usual paradigms being counted as the same, but the irregular parts being considered separate?

And are words like “jack” or “rose” which have so many individual meanings counted as one word or as ten or so words? If a machine does it the count is objective, but again not true to the substance of the matter. If a human intervenes, then subjectivity enters the frame and the way one person collates it may have very different outputs to the way another person would do it.

That’s why I take statements like “80% of the words used are in 10,000 words” with a pinch of salt. Assuming the same rules for collation of word variation under headwords apply through the list, it is probably proper to parrot Pareto and say that 20% of the vocabulary gives 80% of the effect, but that is comparing relative numbers to relative numbers, which is probably wiser, and more likely to hold good when comparing different languages together, than comparing absolute numbers to relative numbers

And then another problem is the way word frequency changes over time. Even more subtle markers of how languages change over time than the incursion of new words (which obviously don’t show up at all in the old frequency lists) is the change not even so much in meaning but in fashionability of words which are there in the language the whole time. Some of these are glaringly obvious – nobody uses “comrade” in Russia as much as it was used 30 years ago. But in fact it applies to a greater or lesser degree to every word we use.

Frequency studies, then, ironically, need to be carried out far more frequently! I answered a question the other day on frequency lists for what used to be Serbo-Croat. Quite a lot of change has entered that language, especially as emerging ex-Yugoslav states have sought to distinguish their language from other similarly speaking states, including new letters in their languages and emphasising either regional words or regional pronunciations. There is of course a degree of artificiality about all that, but it cannot fail to influence the language really spoken by people.

However, from my research the last available frequency studies for Serbo-Croat were from the 1960s. They are not available online for free you have to pay to get them, which I did not do. I can only therefore imagine how erroneous they must be by now. The half life of usefulness of such a study is probably ten years, and these were done 50 years ago, in a different country constellation, a different political regime, with a different world going on around it with different things to do than now and different ways to live. High time for new frequency studies to be made in that language, but also in many other languages. I just wonder how old by now the frequency studies are which Ladybird books admirably used to create the Key Words Reading Scheme so known and loved by children in the English speaking world? Can anybody tell me that?

Top 30 Languages to learn for 2050

Map showing countries and autonomous subdivisi...
The Turkic linguation - to a greater or lesser extent mutually intelligible languages. However often not the preferred business languages of their regions, hence only 12th place in this economic utility-based prediction.

Here are my 2050 predictions, originally shared on http://www.how-to-learn-any-language.com :

1. Chinese (all types)
2. English (all types)
3. Arabic (all dialects)
4. Russian
5. Spanish (all types)
6. Japanese
7. German
8. French
9/10.Portuguese and Korean(if there is Korean unification, Korean takes the higher slot)
11. Italian
12. Turkish and mutually intellible forms of Turkic Continue reading “Top 30 Languages to learn for 2050”

Question on lexical sufficiency

Joseph Conrad
Joseph Conrad Korzeniowski - the ultimate benchmark in mastery of an acquired language is surely that of having added to its artistic literature?

Reader (and poster) Bill_Sage667 from How-To-Learn-Any-Language.com’s forum wrote me the following question and agreed kindly to a public answer here:

Dunno whether u’ll be able to find the time to reply to this, 1 in a million chance lol……but I’ll write out my questions anyway lol

You said something about 15,000 words needed in order to achieve a good degree level in Russian. Are imperfective and perfective verbs considered separate words, as well as adjectives and verbs under the same lexemes (e.g. беремменость, беремменая, беремменеть, забеременнеть) when you were estimating the number of words?

And what if someday I want to attain the proficiency of an educated native speaker (might take me 20 yrs but oh well)? How many words am I supposed to know (for active and passive knowledge)? For Russian, that is. btw thanks for releasing the Gold List Method to the public for free!

Firstly, Bill, be careful about the number of ‘m’s and ‘n’s you have in those pregancy-related words. You have too many ‘m’s and not enough ‘n’s. I’ll leave you to review that one.

You’re very welcome about the Goldlist. As I say in the section I wrote in syzygycc’s The Polyglot Project, I’m just paying forward the favours I got from so many people when I was a young learner.

In my opinion 15,000 words, as long as they are properly selected, are perfectly adequate and in the headlist you would use all the forms initially as separate forms (but not the various conjugations and case endings, only the so-called ‘dictionary forms’) and you could soon condense them on distillation.

If you use the frequency distionary I am selling on www.oioioio.com you will be able to focus on commoner words first. Within the first 10,000 words you do get words that are already pretty specialised that you wouldn’t use maybe more than once a month or so even if you were a native, and so it continues over the next 5,000 as well. You’ll find 15,000 enough to read the great novels comfortably and to appreciate the poetry of Akhmatova, Tsvetaeva and even Strogonova (the last of which you will find uniquely published in this blog as a ‘page’. She is no poorer a poet than these well-known ones, only far less known.)

I would also like to draw people’s attention to something else I wrote about the 15,000 word ‘marathon’ in a thread over on the HTLAL forum:

What this Gladwell character [I’m referring to upstream discussion of someone who said you must have 10,000 hours of learning to become fully fluent, like a native – a claim almost unanimously rejected by every serious linguist and polyglot I know other than those who teach languages privately, as this idea is grist to their mill] needs to bear in mind is the Pareto rule. If it were true (which I dispute) that you need 10,000 hours to become as native (although how this deals with your accent is anyone’s guess) then you could get 80% of it in 20% of the time. That means you’d need to have 2,000 hours study to get to 80% of native fluency. Since that’s ludicrously overcautios, I’d suggest that the 10,000 hour target for full native fluency is overcautious.

The fact is, a person could be like Konrad Korzeniowski (Joseph Conrad) and already writing ground-breaking literature in the language he or she had learned and still have a strong enough accent to provoke politely meant but annoying compliments on the quality of his language by native speakers.

In the end you just have to accept what English speakers accepted for their own language in the main long ago – that as long as it doesn’t hinder comprehension, a foreigner’s accent in English is just as valid as a “native” accent. This is easily accepted by multi-national or mega-regional languages like Spanish, Russian, Chinese, etc, but in places like Poland as there is largely only one way of speaking, the bar is raised for their own language.

So in fact that means that the same n-thousand hours done by an Englishman in Russian could have the Russians noticing very little different about the foreigner, especially if he has a bad haircut. Whereas if he has a really bad haircut and the same n-thousand hours of Czech, the reaction will probably be “he looks like one of us, but our language is difficult and so we must forgive the way he sounds, although obviously we are frank and friendly people so we will tell him to his face at regular intervals that his Czech sucks bigtime.”

Given this subjectivity, I decided long ago never to walk in anyone’s linguistic shadow, but simply to set amounts of words as targets. 15,000 words is in language learning, to my mind, what the marathon is in athletics. If you’re fit, you can do it with patience and training. And if you can do it, nobody can say you’re not fit.

There are longer races, there are tougher events. But the marathon is the ‘classic’ and the marathon runner knows that it’s really a competition against yourself and not really against the runners alongside. Even people coming in at six hours are clapped and get a medal. So should language learning be.

If this article is of interest you can look up the article as plenty of people have some interesting stuff to say, both about the 10,000 hours nonsense and the number of words needed. I get into a discussion with “Lingua Frankly” blogger Niall Beag (known as Cainntear) on when the Pareto rule isn’t just a number like 10,000 with no real basis for being a law. There are also those who are ready to stand up for the honour of the number 10,000 and tell the detractors of 10,000 hours to mind their jolly manners. Excellent thread.

I’m going to add more thoughts there today.

A Question about the Russian Future by Shannon

IMG_2070

One viewer on the youtube channel, a lady called Shannon, wrote to me the following question:

Hello,

Could you please tell me the English equivalent for the Russian simple and compound future tense.

I think I’ve understood both past tenses, but the future tense is something I’m still struggling to get my head around.

Regards,
Shannon

The problem is that they are not really tenses, they are aspects of a single future tense.

Now in English we have aspects, but we don’t always use a verb to show the aspects, sometimes we use other words in the sentence.

Let’s take the example of “yest’/s’yest”. If I say in Russian “Ya s’yem ves’ …” then the expected word afterwards might be “tort” – I will eat the whole cake.

If a Russian says “Ya budu yest’ ves’…” then the rest of the sentence that suggests itself is “den’ ” – I will eat all day.

In this case in English if you can replace “eat” with “eat up” then you know that it’s a perfective aspect. In English it’s not incorrect to say “I will eat the whole cake”, or you can also stress the perfective nature of that action (although it won’t have a very perfecting effect on the figure) by saying “I will eat the cake up”.

Contrast that with the second sentence. “I will eat all day”. You can’t say “I will eat up all day”, it becomes meaningless. You can, of course say something like “All day long tomorrow, I’ll be eating up my fussy children’s left-overs” – in Russian this repetitive future performance of a perfective action would call for bringing in the iterative suffix. “Budu doyedyvat’ “sounds a little clumsy but would give that kind of meaning. The “yv” part of that verb being the iterative suffix.

So in the case of a sentence where in English we could use a simple verb or a phrasal verb, especially a phrasal verb where the sense involves finishing something (eat up, do in, beat up, bring in, etc) we can get a good idea of whether to use a perfective or imperfective future aspect in Russian by asking us where the phrasal verb is just as good if not better than the simple verb, as in the above “eat the cake up”

What about cases where you don’t have a phrasal verb indicating completion to hand? Well, sometimes there are aspectival pairs in English that we don’t even realise are aspectival pairs because this is almost subliminal in our language and not explicit as in Slavonic. So I could give you two sentences:

1. I will fish all day tomorrow

2. I will catch many fish tomorrow.

Which is future imperfective? That’s right, the first. Budu lovit ryby ves’ den’ zavtra. The second is perfective. Tomorrow I will not just fish I will catch many fish. Poymu mnogo ryb, zavtra.

how about this one:

1. He will speak to me about the changes this afternoon.

2. He will tell me about the changes this afternoon.

In which of these am I expressing subliminally that I’m not necessarily expecting complete information? That’s right, the first. In the second, I expect the transmission of complete facts, not just blah-blah. So speak and tell are an aspectival pair.

And sure enough, you find the same in govorit’/skazat’ in Russian. You never hear “on budet skazat” – the closest is if you make it iterative “on budet skazyvat mne raznye veshchi” He will be telling me various things. He will, in other word, repeatedly perform the perfective action of transferring orally various complete pieces of information. He will speak to me about the changes – on budet govorit’ so mnoy o peremenakh means that I’m focussing menatlly on the fact that he is going through the motions of informing me, regardless of whether any actual units of meaningful information, any ‘whole story’ is transferred to me in the process. “On skazhet mne o peremenakh posle obeda” on the other hand means that I’m expecting to hear the whole caboodle from him after lunch.

One of the best ways to understand this is by looking at what we mean in English when we differentiate “until” and “by”. Most languages have a single word for this pair, and in Slavic it’s aspect which gives away which one is needed. Russians and Poles say “do”, German’s have “bis”, but we have two words and we can’t understand why foreigners are always muddling up “until” and “by”.

So you’ll hear Slavs saying “I need you to write the report until Thursday”. At this, you might say “what happens after that, then, does someone else take over?” This sentence in English contains no markers that getting it done before then is required – on the contrary the marker in “until” rather means just keep on going up to a certain time point, and finishing doesn’t enter into it.

So Thursday comes and you are asked for the report, and you hand in a huge 100 page opus and immediately the boss asks “Where’s the Executive Summary?” And so you say “There’s no Executive Summary – how can there be one if the report isn’t finished?” “But I asked you to write the report until Thursday!” “I did! I was writing it all the time, only taking short breaks for food and sleep. That’s why the thing is 100 pages long. but you didn’t tell me it had to be done BY Thursday!”

The boss doesn’t understand this, as to him “until” and “by” are synonyms and not markers of aspect, and promptly sacks the Employee for over-correct use of English.

So you can see from this example that if he had really meant “until”, in Russian he would have used a future imperfective. “Budete pisat’ …” For the meaning “by” he would have used a Russian future perfective “Napishite”.

I hope that helps you get a grip on the idea. If it has, then that is a milestone on your journey towards knowing Russian.

Who is this mystery customer?

Countries where the Russian language is spoken.
The Russian Linguation

The following review can still be read for Derek Offord’s “Using Russian – A Guide to Contemporary Usage” on Amazon.co.uk (not the American Amazon and I really don’t understand why they don’t carry these reviews over, when I want to write for only the UK or only the US I shall forget about the internet altogether!) As it was way back in 2001 I seem to have lost the accreditation for the review along the way. At first it was under my name, but at some stage they must have had a technical blip and the older reviews became “A Customer”. but it’s mine, well enough. I don’t know if my style has changed much in ten years.

36 of 37 people found the following review helpful:
5.0 out of 5 stars
This is essential reading for those doing a Russian degree.
28 Sep 2001
By A Customer

This review is from: Using Russian: A Guide to Contemporary Usage (Paperback)

I bought Using Russian when I was browsing in a bookshop for another language, as I already speak Russian, but when I looked at a few pages it immediately appealed as an excellent update to the way the language has developed since I did my degree. Sections in the book refer to different problems that face the English speaker in particular, such as faux amis. There are also sections on homonyms and other confusing aspects and they act rather like a checklist of what you need to have got right in your head in order not to make too many ‘howlers’ in translations or in conversation.

One particular plus in this book and as I found out in the whole series of ‘Using’ books that this is part of is the focus on register. If there is one thing that separates the wheat from the chaff among language students. it is the understanding and application of the idea of register, and this applies to Russian perhaps more than most European languages, as this is a language in which not only the vocabulary, but also the syntax, grammar and phonetics are all subject to complex nuances. This book was not available when I needed it. Now that it is I urge you to make use of it. It is the book about Russian that I would have liked to have written myself. If I thought there was demand for it, I’d offer to do a sister volume for Polish.

In any event it made me go out and by the sister volumes already in existence for French, German and Spanish. They are of a similar quality to this volume, the weakest is probably the German one, the Spanish one I would put as second favorite. It can be read cover to cover, or simply dipped into as a work of reference.

It is not material for learning the language from scratch, but would be a very useful second step after completing any of the standard self-instruction books such as the Colloquial series, the Teach yourself series or the Linguaphone course.

Either A-level or degree level students of the Language will profit from it and find it enjoyable because of its good presentation and readable style.