Part-of-Speech Tagging (OxUni)

Application domain:	Part-of-Speech Tagging
Further specification:	66025 distinct tagged word tokens, 85171 sentences and 45 tags
Pointers:	Contact James Cussens James.Cussens@comlab.ox.ac.uk
Data complexity:	18 MB
Data format:	Progol

The data

Tagging words with their correct part-of-speech (singular proper noun, predeterminer, etc) is an important precursor to further automatic natural language processing. The task is non-trivial since many (English) words have a number of possible part-of-speech (POS) tags. For example, "Best" can be a superlative adjective, a superlative adverb or a proper noun as in "Mr. Best". The (training) data used here comprised 2/3 of the Wall Street Journal corpus (cf. previous section) which contains sentences where each word is tagged with its correct part-of-speech for that sentence. The corpus contains an unknown amount of noise. From the corpus, each word was mapped to its set of possible POS tags, together with the relative frequency of each POS tag. Positive and negative examples of when to eliminate a POS tag, including the entire unambiguous left and right context (in terms of tags) for the focus word were used by Progol. The left context was reversed for efficiency. Background knowledge comprised predicates for each of the 45 tags, as well as grammatical constructs such as verb phrase and noun phrase.

The experiments

Work is still in progress: however, using the statistical information for each word and a preliminary set of 798 tag elimination clauses induced using Progol, a per-word test set accuracy of 95.7% was achieved.

Reference

Cussens, J., Part-of-speech disambiguation using ILP. Technical report, Oxford University Computing Laboratory, 1996.

back to index