How can we query a large database and get the most relevant text documents? What methodology displays the best results and what does this tell us about the nature of our language and our existing methodologies of research? Tell me honestly that none of those questions grabs your interest and I’ll call you a liar!
Tzoukerman, Klavans & Strzalkowski. “Oxford Handbook of Computational Linguistics.” Edited by R. Mitkov (2003).
I came to this paper thinking tangentially that the degree that a statement or query intersects with a corpus would yield fruitful results for testing expected behaviour and sentiment with a sample and boy was I right. Whilst the aim of the paper is more around the traditional query in -> relevant resource out, it still mentions the use of information to test hypothesis.
Rather than summarise, I’d love to share some of the really interesting ideas that stood out for me.
The first one was the idea of over- and under- stemming. Overstemming is relating forms that are not in fact morphologically related (magnesia, magnet, magnetic). Understemming is when related words aren’t conflated like acquisition and acquire.
These can manifest in either overgeneralising and muddy-ing your conclusions or undergeneralizing (is that even a word) and you miss your trend. Two scary potentials that manifest in a surprising truth.
Harman (1991) shows that stemming provides no improvement over no spemming at all and that different stemming slgorithms to not affect performence.
So.. there’s no advantage?!
Well not necessarily, there are studies showing between 1.3% and 45.3% performance improvement but I’ll admit, what I’m after is the efficiency in using this in data mining so business cases are most interested in accuracy? It’s a confusing question and one that I’ll admit I don’t know, the terminology of NLP is something I want to work on, especially in the metrics side of NLP and Comp-Lang.
There’s then some analysis of the method of indexing using a frequency distribution and the impact that this makes with and without stemming (spoiler alert: not much).
The second interesting point is on page 539. This section is regarding query building to run against the indexed document database. Again, I’m most impressed by the lack of result. So, the aim is to return relevant documents with a simple natural language query. There was some initial background on how initial forays into this research area were semantically based, aiming to mark up concepts and ideas that are nested within the text to allow for better accuracy but these were too difficult to implement (take note Google with your schema fetish (Meusel, Bizer, & Paulheim 2015).
The conclusion was the impact that WordNet had on the research. Coming into this paragraph I was thinking about how I would approach the question and using a thesaurus to get a more rounded view of the concepts in the query was what I was thinking would be the best bet, something I’ve done with WordNet before for work. Turns out that the results of this were poor!
…formal evaluations revealed that in moset cases the impact of this knowlage on query expansion has been negligible (Voorhees 1993)
This, along with the earlier note was fascinating to me. The authors mention that this lack fo success might have been due to flawed WordNet data or the fact that synonyms were likely to be less popular than the original query terms and as such, only pick up some long tail results. With the keyword expansion work, we do where I work (shout out to Propellernet), I’d be interested to compare the ROI we get on the longer tail phrases and whether they would be in line with Voorhees work.
This was the last chapter I powered through in the Handbook, it’s a good basis/reference book but for the money it’s probably not as worthwhile as reading a good blog on the subject 😉
Next might be the results of some AAVE analysis I’m doing with the help of Twitter after being inspired/angered by last week!