Comparing Corpora using Frequency Profiling – Rayson & Garside

If you want to learn how to do a technique then it might be an idea to check the source of the technique in the first place. Whilst Rayson and Garside didn’t invent the technique, they perfected it! In the last post I explained how I implemented their work, this post is all about the ins and outs of their paper that has been cited a huge 492 times!

Rayson, P., & Garside, R. (2000, October). Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora(pp. 1-6). Association for Computational Linguistics.

First things first, its a really short paper! I was able to read it (available here) in under a day between work! I definitely recommend you read it as it’s really well written and doesn’t expect you to have any statistical or linguistic background.

The aims that the authors go into the article aiming to solve is how to discover keywords (terms that differentiate, summarise or are crucially different to another corpus), to differentiate one corpus, author or style from another, to differentiate word sense and to do all of this quickly, especially with the rise in NLP for business and commercial purposes.

They position their research in respect to the use of the chi square test and the difference coefficient (Yule 1944). Both have their own place but are limited in terms of the adaptability that the log likelihood ratio.

The process was the main meat of the paper and honestly, it’s really well explained and as an example, here’s a print screen from the paper;


Interestingly, they mention how the p value is going out of fashion in sociological and psychological research so the benefit of the LLR is that it’s able to show the size of effect in the output itself. However, without the hypothesis and with the expected randomness of language, the authors strongly recommend researchers  “qualitatively examine examples of the significant words highlighted by this technique.”.

In the example that the authors provide, they use a part of speech tagger and most interestingly, they use a semantic analyser from Rayson and Wilson (1997) and you have no idea how interesting this idea is. Really interesting to see how they could convert the language into a theme or topic rather than having to just use the tokenized words. Super interesting but something for another day… hopefully, there’s a python library for it!

All in all, the log likelihood ratio is a great method for frequency profiling, with a fairly easy implementation and some nice outputs that deliver actionable insights… or something.

Next week, probably Schmitz, R. M., & Kazyak, E. (2016). Masculinities in Cyberspace: An Analysis of Portrayals of Manhood in Men’s Rights Activist Websites. Social Sciences5(2), 18. It’s interesting but more so in the opportunities for further research!



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s