How to implement a log likelihood test on corpora using Python

Yep, the stats are back this week and they are even better! I took my lunch break to implement the log likelihood ratio that is described in this fascinating paper. It took me less time to code than the blasted chi squared and runs at least 10 times as fast. Here’s how I did it!

Groundwork first, this is what the contingency table will look like;

mmm, that high res ascii

and from this we need to work out e1 and e2 with the following formulas


e1 = t1*(c1+c2)/(t1+t2)

e2 = t2*(c1+c2)/(t1+t2)

and finally from this we can work out the log likelihood ratio (ll1);

ll1 = 2*((c1*math.log(c1/e1))+(c2*math.log(c2/e2)))

Yep, you saw that right, you’re doing proper maths here so you’re gonna need to import math from the standard library.

The results from this will look a little like this,


We interpret this as a size of effect of 213.7 or so. Pretty big! In this example I was comparing my dissertation corpora of r/mylittlepony and r/theredpill, hence the importance of the word male (read scientific, abstract and impersonal) in the r/theredpill corpora.

Honestly, that’s it! I’ll do a full write-up of the paper, it’s got some juicy bits in it but for now this is what I was really proud of today and what I think you might find useful!

Next up, hopefully, a longer write-up of Rayson, P., & Garside, R. (2000, October). Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora(pp. 1-6). Association for Computational Linguistics.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s