Corpus Anotation

On the back of the corpus chapter that I read through here, I thought that I would pick up an old project that I might explain in another post. Long story short, I wanted to try to build a system that will take input text and return innuendo. I chose innuendo as a form of humour because of seeming ease that anything can be twisted meaning training material for the system would be fruitful.

I started out by opening Google… I’m not going to pretend that I knew all this already!

I imported my corpus (only a 5k line sample of my 27,343Kb monster of  .txt web-scraping), tokenised with nltk.word_tokenize(), tagged with nltk.pos_tag() and, well, that’s enough to return a nested tuple of tagged tokens! Here’s a sample;

(…(‘beautiful’, ‘JJ’), (‘sight’, ‘NN’), (‘.’, ‘.’), (‘Allan’, ‘NNP’), (‘,’, ‘,’), (‘one’, ‘CD’), (‘of’, ‘IN’), (‘the’, ‘DT’), (‘guys’, ‘NNS’), (‘sharing’, ‘VBG’), (‘our’, ‘PRP$’), (‘room’, ‘NN’), (‘was’, ‘VBD’), (‘there’, ‘RB’), (‘too’, ‘RB’), (‘So’, ‘RB’), (‘I’, ‘PRP’), (‘knew’, ‘VBP’), (‘who’, ‘WP’), (‘it’, ‘PRP’), (‘was’, ‘VBD’), (‘that’, ‘IN’), (‘came’, ‘VBD’), (‘out’, ‘RP’), (‘on’, ‘IN’), (‘the’, ‘DT’), (‘balcony’, ‘NN’), (‘He’, ‘PRP’), (‘stood’, ‘VBD’), (‘behind’, ‘IN’), (‘me’, ‘PRP’), (‘and’, ‘CC’), (‘his’, ‘PRP$’), (‘hands’, ‘NNS’), (‘on’, ‘IN’), (‘my’, ‘PRP$’), (‘shoulders’, ‘NNS’), (‘did’, ‘VBD’), (“n’t”, ‘RB’), (‘make’, ‘VB’), (‘me’, ‘PRP’), (‘jump’, ‘NN’), (‘.’, ‘.’), (‘He’, ‘PRP’), (‘commented’, ‘VBD’), (‘on’, ‘IN’), (‘the’, ‘DT’), (‘sight’, ‘NN’), (‘,’, ‘,’), (‘and’, ‘CC’), (‘how’, ‘WRB’), (‘good’, ‘JJ’), …)

Errors aside, I’m going to talk about them later, I just had to start digging into some insights within the corpora.

So, let’s have a look at the most popular nouns, make up a counter function that increments a key value or creates a key with a value for 1 corresponding to the tokened word. Created a sorted list of tuples using this handy recipe:

sorted_nn = sorted(nn.items(), key=operator.itemgetter(1)

and then from this, I can create a nice graph!

least we know it’s a corpus… good for our purposes!

I’m going to see if I can work on some pattern matching against the input data that we want to smutify in the meantime but I thought this was a good demonstration of how useful nltk can be to the linguistic researcher and how quick it can be to process horrific quantities of data!

Positivity aside, let’s look at the errors that our tagger is throwing up…

Taking an unfiltered frequency distribution of the tagged tokens and comparing them to their tag we can get a rough idea of the words that are tagged wrong most often. A little list comprehension wizardry and we get this list of frequency against the POS tag;

(‘the’, 4108, ‘DT’)
(‘and’, 3607, ‘CC’)
(‘her’, 3000, ‘PRP$’)
(‘to’, 2962, ‘TO’)
(‘she’, 1859, ‘PRP’)
(‘was’, 1778, ‘VBD’)
(‘of’, 1703, ‘IN’)
(‘a’, 1700, ‘DT’)
(‘my’, 1649, ‘PRP$’)
(‘he’, 1508, ‘PRP’)
(‘in’, 1322, ‘IN’)
(‘his’, 1318, ‘PRP$’)
(‘it’, 1279, ‘PRP’)
(‘as’, 1163, ‘RB’)
(‘you’, 967, ‘PRP’)
(‘that’, 961, ‘IN’)
(‘me’, 909, ‘PRP’)
(‘with’, 876, ‘IN’)
(‘on’, 876, ‘IN’)
(“‘s”, 723, ‘VBZ’)…

Nothing that worrying so far but these are pretty typical words, a little more digging does show some errors with the 88th most common token “(‘cum’, 171, ‘VB’)” where ‘cum’ could either be a verb or noun… echoing the problem of homonyms that Dr McEnery mentioned. Looking back I should have created a frequency distribution of a concatenation of the token and the tag in case the pos_tag() dynamically tags dependent on context (that would be really cool if it did).

A bit of background on the pos_tag() function. It’s based on the ML system PerceptronTagger “as implemented by Matthew Honnibal” according to this page, just a shame that it links to a 404. The github of the project can be found here and looks like a really readable project. I might just look into a project of reading through code like this in a blog format… but that’s for another day!

In conclusion, there are quick wins to be had here with potential for insights in the frequency of parts of speech. In terms of future developments, I think that looking at the patterns that the text creates would be fruitful for how the nouns are modified by adjectives and what actions they are performing and whether they are the active or passive actor in the phrase but that’s something for another time!

Next up is still going to be Hearst MA. Text Data Mining The Oxford handbook of computational linguistics. Oxford University Press. and maybe some code analysis/ reading in the meantime!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s