Application domain: | Natural Language Processing |
Source: | UIA |
Dataset size: | 6.8 Mbytes |
Data format: | Text file |
Systems Used: | Progol, MACCENT, WARMR |
Pointers: | Luc Dehaspe (ldh@cs.kuleuven.ac.be) |
Raw data from the Wall Street Journal corpus (Penn Treebank Project, Release 2) was pre-processed by UIA/KUB to add blank lines at the end of each sentence. UIA/KUB also induced a context-free grammar from the pre-parsed parts of the Wall Street Journal corpus. This grammar had 17,302 production rules.
WARMR (Dehaspe and De Raedt 1997) was used to exhaustively search the WSJ corpus for first order association rules that describe frequent properties of sentences such as:
in 81% of the sentences in which a determiner occurs followed by an adjective, there is also a sequence determiner, adjective, nounand frequent properties of words, such as:
74% of the words that have a verb somewhere to the left within distance 4 and a preposition or subordinating conjunction to the right within distance 4 have a noun somewhere to the left within distance 4
In an initial small experiment with MACCENT
(Dehaspe 1997) a probability distribution
based on the context of
the word was induced for words that are ambiguous between exactly the
three classes noun, adjective, and verb. This initial experiment
demonstrates the (poor) output of MACCENT can successfully be
combined with other stochastic information sources such as the
lexicon.
In an ongoing more systematic experiment WARMR is used as a feature generator for MACCENT. In a first stage WARMR generates frequent features, which are then added collectively to MACCENT. If restricted to a baseline set of non-relational features, MACCENT in this mode emulates the state-of-the-art system of (Ratnaparkhi 1996), with an excellent performance around 96.8% accuracy on testdata with unkown words. The aim of the experiment is to study the effects of additional relational features.
YORK used a small subset of the original grammar comprising only 40 production rules. Using this small grammar and a chart parser provided by KUL, YORK produced background knowledge in the form of charts for each sentence in the training data. These charts were represented as ground atoms: 'VP'(1,7,11). states that there is a VP between edges 7 and 11 in sentence 1. Non-ground clauses were also added to the background knowledge to allow simple morphological analysis of words. Examples were represented as - 'CC'(8,2,3,nor). - the words between edges 2 and 3 in sentence 8 is ``nor'' and it is tagged as a ``CC''. Using such knowledge representation it is possible to combine lexical, morphological and syntactic information to produce tagging rules. However, making use of the large amount of data requires further work on (i) sampling and (ii) learning from datasets (and possibly background knowledge) residing on disk. Work on the latter is ongoing at YORK. YORK intends to construct features from tagging clauses induced by Progol and have these as added features in KUL's MACCENT system.