Word Frequency Issues for Language Learners

Peter and Jane book 1a Play with us

One well-known and practical application of word frequency analysis, but is the word list still current?

I wanted to address some issues with the whole question of using word frequency analysis when learning languages. It is obviously a good idea to use frequency studies if they are available.

They can function both as checklists to ensure that our courses (whether we are users or compilers of courses) contain a complete coverage of the most frequent words. I do think that a course which claims to give 2,000 words lets its users down if a noticeable percentage of say the top 1000 words are still missing, whereas students are expected to learn words for wainscotting, walruses and woodpeckers.

One big problem with frequency dictionary analysis and word count – especially when comparing between languages or methods – is what does it mean? If we are talking about uninflected languages then the number of individual words is shorter. The original poster referred to Italian, and here it is a problem, because every single verb has umpteen forms, so is that one word or is that umpteen words?

If you use a machine to collate the frequency, then “has”, “have” and “had” will all show up as different words. Should it be three or one? In those cases where a noun has 12 separate declined forms in Czech or Polish, is that 12 words or one word? Or is it something in between, with the forms guessable out of usual paradigms being counted as the same, but the irregular parts being considered separate?

And are words like “jack” or “rose” which have so many individual meanings counted as one word or as ten or so words? If a machine does it the count is objective, but again not true to the substance of the matter. If a human intervenes, then subjectivity enters the frame and the way one person collates it may have very different outputs to the way another person would do it.

That’s why I take statements like “80% of the words used are in 10,000 words” with a pinch of salt. Assuming the same rules for collation of word variation under headwords apply through the list, it is probably proper to parrot Pareto and say that 20% of the vocabulary gives 80% of the effect, but that is comparing relative numbers to relative numbers, which is probably wiser, and more likely to hold good when comparing different languages together, than comparing absolute numbers to relative numbers

And then another problem is the way word frequency changes over time. Even more subtle markers of how languages change over time than the incursion of new words (which obviously don’t show up at all in the old frequency lists) is the change not even so much in meaning but in fashionability of words which are there in the language the whole time. Some of these are glaringly obvious – nobody uses “comrade” in Russia as much as it was used 30 years ago. But in fact it applies to a greater or lesser degree to every word we use.

Frequency studies, then, ironically, need to be carried out far more frequently! I answered a question the other day on frequency lists for what used to be Serbo-Croat. Quite a lot of change has entered that language, especially as emerging ex-Yugoslav states have sought to distinguish their language from other similarly speaking states, including new letters in their languages and emphasising either regional words or regional pronunciations. There is of course a degree of artificiality about all that, but it cannot fail to influence the language really spoken by people.

However, from my research the last available frequency studies for Serbo-Croat were from the 1960s. They are not available online for free you have to pay to get them, which I did not do. I can only therefore imagine how erroneous they must be by now. The half life of usefulness of such a study is probably ten years, and these were done 50 years ago, in a different country constellation, a different political regime, with a different world going on around it with different things to do than now and different ways to live. High time for new frequency studies to be made in that language, but also in many other languages. I just wonder how old by now the frequency studies are which Ladybird books admirably used to create the Key Words Reading Scheme so known and loved by children in the English speaking world? Can anybody tell me that?

About David J. James

56 year old UK origin Chartered Accountant and business consultant who loves languages, literature, history, religion, politics, internet, vlogging and blogging and lively written or spoken discussion, plays backgammon and a few other board games. Walks and listens to Audible for hours a day usually, and avoids use of the car. Conservative Christian, married to an angel with advanced Multiple Sclerosis. We have three kids, two of them autistic, and we live in Warsaw, Poland. On the board of the main British-Polish charity Fundacja Sue Ryder in Poland, and involved in the Vocational Autistic School of "Nie Z Tej Bajki" in Warsaw. Member of Gideons International. Serves on two committees of the Chamber of Auditors in Poland, and on several Boards and Supervisory Boards. Has own consultancy called Quoracy.com delivering business governance and audit/valuation solutions as well as mentoring. Author of the GoldList Method for systematic optimal use of the long-term memory in learning.

Posted on 16/03/2011, in Blog only, GoldList Method, Languages and Linguistics and tagged , , , , , . Bookmark the permalink. 1 Comment.

  1. A hearty Amen to that. I’ve experienced all of these problems with frequency lists and finally threw my hands up in the air in surrender. What I have been doing is reading Russian language newspapers and watching Russian tv. When I see a recurring word written and spoken it goes into the goldlist. Maybe you might come up with a method to compile a frequency list as a companion to the Goldlist method. Not to put any pressure on you…..

Leave a Reply

%d bloggers like this: