The Mann-Whitney-Wilcoxon U-Test For Corpus Linguistics (Python)

I’m currently working on the analysis for the counter/analysis of the hypothesis proposed in this paper I read recently and I thought I might share back in how I’m the data do my bidding.

All cards on the table: I’m using Python 2.7 on a laptop with an i7 in it on a corpus of 14000 tweets pulled from a set of seed keywords that are linked to AAVE and a comparison corpus that is based on general Twitter usage.

Good? Good! I’ll start with the process, then cover some of the theory of why you’d use the Mann Whitney Wilcoxon, why it works in my case and then finally how it works!

So I started off with my corpus in CSV format. Basic I know but as there’s nothing else out there about the basic process, I’m going to go through all my steps!

I import the corpus within nested ‘with x as y:’ as is good practice, then pull the data into a reader with this clever one-liner so it’s a list rather than a generator, top tip right there!

Your code is slow and you should feel bad!

reader1 = list(csv.reader(aave_csv_2.1.csv))

From this, I need to unpack the tweets as currently, they are in a nested list within the CSV so I kill two birds with one stone and tokenize them with NLTK so I’ve got the text in the right format for my analysis to match my hypothesis. The process for this is the slowest in terms of time so I might recommend refactoring this into a generator function but this worked for me:

concat1 = []

for tweet in reader1:

# print tweet[0]

  for token in nltk.word_tokenize(tweet[0].decode(“utf-8”)):

    concat1.append(token)

This yields a nice list of all tokens in all the tweets within the corpus.

I then shuffle the lists to account for the random sampling required for the sampling comparison and ranking so we import random and use the following:

concat1 = random.shuffle(concat1)

Now it’s time to take the samples from the corpus we need. I’m still experimenting with the ideal size and number of samples for the test so I made up a function to slice the concatenated list into the right sized chunks (as having the same sized chunks in the samples is important). This function, again, is far from optimised but it works… and that’s all I really care about;

def chunks(list, size):

  slicings = range(0, len(list)+1, size)

  temp = []

  for item in range((len(list)/size)):

    temp.append(list[slicings[item]:slicings[item]+size])

  return temp

Epic! We now have our equally sized random samples (without overlap!) from the two corpora that we can now compare in terms of the frequency of our words (because of my hypothesis) so we make this fuck ugly iterator to compare and add to the list of samples:

for item1, item2 in zip(chunk1, chunk2):

  if item1.count(word) > item2.count(word):

    samplerank1.append(1)

    samplerank2.append(2)

  if item1.count(word) < item2.count(word):

    samplerank1.append(2)

    samplerank2.append(1)

  if item1.count(word) == item2.count(word):

    pass

And then we’re finally ready to run the Scikit function I know you’ve all been waiting for!

print stats.mannwhitneyu(samplerank1, samplerank2)

and we have our P variable and our statistic!

“If the implementation is hard to explain, it’s a bad idea.”…

BUT WHY DOES IT WORK JOE?!

I hear you cry… well! It works because I’m aiming to compare the relation between two frequency distributions to find the tokens that have statistically significant differences. The hypothesis I’m testing is that of Mucholtz who claims that AAVE language is used, because of its links to hyperphysical masculinity, for violence and aggression. Our test would show what terms are showing higher frequencies from our corpus compared to the control corpus.

Finally, how does it work? Honestly, I’m not going to try to summarise but I am going to recommend you read the Wikipedia article because while it’s pretty limited in terms of its links to linguistics, it does explain the sampling and ranking process!


Next week? I’m going to be reading more of Kuznetsova, J. (2015). Linguistic profiles: Going from form to meaning via statistics (Vol. 53). Walter de Gruyter GmbH & Co KG. as it’s a great read, about a language that interests me and seems to have some links to gender and the usage of gendered linguistics!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s