Application domain: | Natural Language Processing |
Further specification: | data (WSJ corpus) |
Pointers: | Contact Walter Daelemans Walter.Daelemans@kub.nl |
Data complexity: | 25 MB (2.5 million words) |
Data format: | tagged corpus |
Language Technology (the design and construction of Natural Language Processing systems; e.g. Machine Translation, Natural Language Interfaces, Intelligent Text Processors, etc.) presents a number of unique and complex problems for which ILP techniques might be successfully used. Results for these real-world problems, which are very different from standard ML benchmark problems, will provide valuable feedback to the ILP community about the possibilities and limitations of the approach in Language Technology.
The formalization of lexical and grammatical knowledge necessary for building robust NLP systems has to be done (almost from scratch) for each new language, application, and domain, and therefore constitutes a serious knowledge acquisition bottleneck. Standard data mining techniques (rule and decision tree induction, memory-based learning) have been successfully used in this area to alleviate this problem, but these techniques exhibit an important drawback: the input and learned representations are limited to a propositional language (attribute-value representations). ILP techniques may help in the acquisition of more compact and linguistically more relevant generalizations because of their ability to learn more expressive languages. We can contribute to the ILP consortium some of our databases representing core classification problems in areas ranging from phonology (phoneme recognition and transliteration) to semantics (PP-attachment), and propose to study the applicability of ILP techniques to learning these problems in comparison with alternative inductive learning techniques.
John hit Pete -> Noun Verb NounThe target category inventory (tag set) may range from extremely simple (order 10) to extremely complex (order 1000).
Tagging is a problem because of massive ambiguity in natural language text (ambiguity also depends of course on the tag-set used). E.g. in the example above, hit can be both a noun and a verb, context determines that in this case it is a verb.
The correct category of a word depends on 1. its lexical probability p(cat|word), and 2. its contextual probability p(cat|context).
A good tagger is instrumental in a large number of language engineering applications (ranging from text-to-speech over parsing to information retrieval). However, target tagset and training corpus differ from one application to the next, and making a tagger by hand is expensive and difficult. Therefore, robust, accurate taggers which can be automatically learned from example corpora are a commercially intersting product.
A tagger would be trained on this knowledge (and additional domain knowledge), take sequences of untagged text as input, and produce the corresponding tags strings.
( (S (NP BELL INDUSTRIES Inc.) (VP increased (NP its quarterly) (PP to (NP 10 cents)) (PP from (NP seven cents (NP a share))))) .)