Yep, the stats are back this week and they are even better! I took my lunch break to implement the log likelihood ratio that is described in this fascinating paper. It took me less time to code than the blasted chi squared and runs at least 10 times as fast. Here’s how I did it!
Groundwork first, this is what the contingency table will look like;
and from this we need to work out e1 and e2 with the following formulas
e1 = t1*(c1+c2)/(t1+t2)
e2 = t2*(c1+c2)/(t1+t2)
and finally from this we can work out the log likelihood ratio (ll1);
ll1 = 2*((c1*math.log(c1/e1))+(c2*math.log(c2/e2)))
Yep, you saw that right, you’re doing proper maths here so you’re gonna need to import math from the standard library.
The results from this will look a little like this,
We interpret this as a size of effect of 213.7 or so. Pretty big! In this example I was comparing my dissertation corpora of r/mylittlepony and r/theredpill, hence the importance of the word male (read scientific, abstract and impersonal) in the r/theredpill corpora.
Honestly, that’s it! I’ll do a full write-up of the paper, it’s got some juicy bits in it but for now this is what I was really proud of today and what I think you might find useful!
Next up, hopefully, a longer write-up of Rayson, P., & Garside, R. (2000, October). Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora(pp. 1-6). Association for Computational Linguistics.