Natural Language Processing (KUL)

Application domain: Natural Language Processing
Further specification: data (WSJ corpus)
Pointers: Contact Walter Daelemans
Data complexity: 25 MB (2.5 million words)
Data format: tagged corpus

Language Technology (the design and construction of Natural Language Processing systems; e.g. Machine Translation, Natural Language Interfaces, Intelligent Text Processors, etc.) presents a number of unique and complex problems for which ILP techniques might be successfully used. Results for these real-world problems, which are very different from standard ML benchmark problems, will provide valuable feedback to the ILP community about the possibilities and limitations of the approach in Language Technology.

The formalization of lexical and grammatical knowledge necessary for building robust NLP systems has to be done (almost from scratch) for each new language, application, and domain, and therefore constitutes a serious knowledge acquisition bottleneck. Standard data mining techniques (rule and decision tree induction, memory-based learning) have been successfully used in this area to alleviate this problem, but these techniques exhibit an important drawback: the input and learned representations are limited to a propositional language (attribute-value representations). ILP techniques may help in the acquisition of more compact and linguistically more relevant generalizations because of their ability to learn more expressive languages. We can contribute to the ILP consortium some of our databases representing core classification problems in areas ranging from phonology (phoneme recognition and transliteration) to semantics (PP-attachment), and propose to study the applicability of ILP techniques to learning these problems in comparison with alternative inductive learning techniques.

The Part-of-Speech Tagging Problem

Problem description

Given a text, provide for each word in the text its contextually disambiguated part of speech (morphosyntactic category). I.e. transform a string of words into a string of tags: E.g.,
John hit Pete -> Noun Verb Noun
The target category inventory (tag set) may range from extremely simple (order 10) to extremely complex (order 1000).

Tagging is a problem because of massive ambiguity in natural language text (ambiguity also depends of course on the tag-set used). E.g. in the example above, hit can be both a noun and a verb, context determines that in this case it is a verb.

The correct category of a word depends on 1. its lexical probability p(cat|word), and 2. its contextual probability p(cat|context).

A good tagger is instrumental in a large number of language engineering applications (ranging from text-to-speech over parsing to information retrieval). However, target tagset and training corpus differ from one application to the next, and making a tagger by hand is expensive and difficult. Therefore, robust, accurate taggers which can be automatically learned from example corpora are a commercially intersting product.

State of the Art

There are rule-based systems (hand made or using rule-induction), and statistical systems (using (hidden) markov modeling and dynamic programming), there are also case-based approaches. Although a thorough reliable comparison of these approaches has not yet been achieved, it seems to be the case that all approaches converge to a 96-97% accuracy on new text from the same type as the training material. This may seem pretty good, but when looking at accuracy on sentences, this means that only about 33% of sentences are correctly tagged.

Opportunities for ILP

One residual problem that has been isolated is the problem of long-distance dependencies. All current approaches (except perhaps the hand-crafted rule-based one) are limited to a local context of 2 to 4 words around the focus word to be disambiguated. However, sometimes words far from the focus word may trigger a particular disambiguation. ILP may help in solving this problem.

Data Description

Several tagged corpora are available for English and Dutch. Typically, a corpus of correctly (with some noise) tagged text, might look like this (Wall Street Journal Corpus).

A tagger would be trained on this knowledge (and additional domain knowledge), take sequences of untagged text as input, and produce the corresponding tags strings.


Problem Description

One step further than tagging is parsing: assigning a syntactic structure (parse tree), reflecting the relations between parts of the sentence, to a sentence. A parse tree allows a semantic component to compute a meaning representation (logical form) for the sentence.

Data Description

Corpora, annotated with syntactic structures (treebanks) are available which can be used to derive training material. Although there has been a lot of research on the development of grammars for language fragments (these might be used as background knowledge), not very many useful are publicly available. The WSJ corpus has the following type of tree structures.
( (S
   (VP increased
       (NP its quarterly)
       (PP to
           (NP 10 cents))
       (PP from
           (NP seven cents
               (NP a share)))))

back to index