Word Frequency Issues for Language Learners

Peter and Jane book 1a Play with us

One well-known and practical application of word frequency analysis, but is the word list still current?

I wanted to address some issues with the whole question of using word frequency analysis when learning languages. It is obviously a good idea to use frequency studies if they are available.

They can function both as checklists to ensure that our courses (whether we are users or compilers of courses) contain a complete coverage of the most frequent words. I do think that a course which claims to give 2,000 words lets its users down if a noticeable percentage of say the top 1000 words are still missing, whereas students are expected to learn words for wainscotting, walruses and woodpeckers.

One big problem with frequency dictionary analysis and word count – especially when comparing between languages or methods – is what does it mean? If we are talking about uninflected languages then the number of individual words is shorter. The original poster referred to Italian, and here it is a problem, because every single verb has umpteen forms, so is that one word or is that umpteen words?

If you use a machine to collate the frequency, then “has”, “have” and “had” will all show up as different words. Should it be three or one? In those cases where a noun has 12 separate declined forms in Czech or Polish, is that 12 words or one word? Or is it something in between, with the forms guessable out of usual paradigms being counted as the same, but the irregular parts being considered separate?

And are words like “jack” or “rose” which have so many individual meanings counted as one word or as ten or so words? If a machine does it the count is objective, but again not true to the substance of the matter. If a human intervenes, then subjectivity enters the frame and the way one person collates it may have very different outputs to the way another person would do it.

That’s why I take statements like “80% of the words used are in 10,000 words” with a pinch of salt. Assuming the same rules for collation of word variation under headwords apply through the list, it is probably proper to parrot Pareto and say that 20% of the vocabulary gives 80% of the effect, but that is comparing relative numbers to relative numbers, which is probably wiser, and more likely to hold good when comparing different languages together, than comparing absolute numbers to relative numbers

And then another problem is the way word frequency changes over time. Even more subtle markers of how languages change over time than the incursion of new words (which obviously don’t show up at all in the old frequency lists) is the change not even so much in meaning but in fashionability of words which are there in the language the whole time. Some of these are glaringly obvious – nobody uses “comrade” in Russia as much as it was used 30 years ago. But in fact it applies to a greater or lesser degree to every word we use.

Frequency studies, then, ironically, need to be carried out far more frequently! I answered a question the other day on frequency lists for what used to be Serbo-Croat. Quite a lot of change has entered that language, especially as emerging ex-Yugoslav states have sought to distinguish their language from other similarly speaking states, including new letters in their languages and emphasising either regional words or regional pronunciations. There is of course a degree of artificiality about all that, but it cannot fail to influence the language really spoken by people.

However, from my research the last available frequency studies for Serbo-Croat were from the 1960s. They are not available online for free you have to pay to get them, which I did not do. I can only therefore imagine how erroneous they must be by now. The half life of usefulness of such a study is probably ten years, and these were done 50 years ago, in a different country constellation, a different political regime, with a different world going on around it with different things to do than now and different ways to live. High time for new frequency studies to be made in that language, but also in many other languages. I just wonder how old by now the frequency studies are which Ladybird books admirably used to create the Key Words Reading Scheme so known and loved by children in the English speaking world? Can anybody tell me that?

About David J. James

52 year old accountant who loves languages, literature, history, religion, politics, internet, vlogging and blogging and lively written discussion. Conservative Christian, married to an angel, we have three kids, and live in Warsaw, Poland. I can help you with company set-up, bookkeeping, payroll, tax, audit and due diligence all over Poland and the region.

Posted on 16/03/2011, in Blog only, Gold List Methodology, Languages and Linguistics and tagged , , , , , . Bookmark the permalink. 1 Comment.

  1. A hearty Amen to that. I’ve experienced all of these problems with frequency lists and finally threw my hands up in the air in surrender. What I have been doing is reading Russian language newspapers and watching Russian tv. When I see a recurring word written and spoken it goes into the goldlist. Maybe you might come up with a method to compile a frequency list as a companion to the Goldlist method. Not to put any pressure on you…..


Your thoughts:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: