I went into this chapter (24 in the Oxford Handbook of Computational Linguistics) to answer a question that motivated me to get the book in the first place: “How should I extract a quantitive proof from a corpus?”. Unfortunately, it didn’t answer this question but it did provide a great jumping off point for further research.
Mitkov, R. (2005). The Oxford handbook of computational linguistics. Oxford University Press.
McEnery (whose bibliography can be found here) aims for an overview of the use of corpora in linguistics and the problems that come with it. Whilst I found nothing interesting, I realise that the target audience may be someone from a computational background coming into the field of comp-lang. Maybe? Honestly, I don’t know, the tone and level of other chapters are pitched differently. I’d recommend using this chapter if you’re considering linguistic approaches and want an overview of the pros, cons and overall processes are.
I was really interested in learning about the importance of sampling, the way that McEnery breaks down the terminology of Sampling frame and that it should be consciously taken into account. McEnery says;
… the corpus should aim for balence and representitiveness whithin a specivic sampling frame, in order to allow a particlar variety of language to be studied or modelled.
It’s an important point that I’m lucky I stumbled into fulfilling without thinking about in my master’s dissertation!
Not much time is given to the question of constructing or using a corpus to satisfy a research question but I assume that’s because McEnery is aiming to answer a different set of questions (“what is a corpus, what are the pros and cons and different varieties?”). I would have appreciated a few signposts to potential options of using corpora to answer a research question but that’s a me-problem.
McEnerey gives a good amount of time to exploring corpus annotation and the process needed for annotation. Evaluating fully automated, semi-automated and manual processes, McEnery is most positive of the potentials (in terms of cost/benefit) in semi-automated processes systems. While not 100% accurate, they would be more consistent.
… given the same set of decision making conditions in two different parts of the corpus, the answer given is the same. This consistancy for the machine derives from its impartial and unswerving application of a program
It seems that being aware of this in your analysis of the corpus would outweigh the quirks of the corpus.
In conclusion, it’s a good chapter but definitely, fits into a handbook for use in definitions and foundations of a study rather than guiding the implementation of a specific process.
Honestly, I’m excited to crack open a corpus of text that I scraped a year ago now, apply some NLTK tagging goodness and see if I can pull some meaning from it. Next paper should be: Hearst MA. Text Data Mining The Oxford handbook of computational linguistics. Oxford University Press.