Learning Natural Language Parsers (York+CTU)

Application domain: Learning natural language parsers

Source: T5000 treebank + lexicon L5000 (artificially generated data)

Dataset size: 5046 parsed sentences + a lexicon with 84 word-forms, and 468 combinations of word-form, part of speech, and semantic tag.

Data format: Prolog

Systems used: LAPIS

References: (Kazakov 1999)

Pointers: kazakov@cs.york.ac.uk, step@labe.felk.cvut.cz

There is a history of research focussed on learning of shift-reduce parsers from syntactically annotated corpora by the means of machine learning techniques based on logic. The presence of lexical semantic tags in the treebank has proved useful for learning semantic constraints which limit the amount of nondeterminism in the parsers. The level of generality of the semantic tags used is of direct importance to that task. We combine the ILP system LAPIS with the lexical resource WordNet to learn parsers with semantic constraints. The generality of these constraints is automatically selected by LAPIS from a number of options provided by the corpus annotator. The performance of the parsers learned is evaluated on an original corpus (T5000) also described in the article (Kazakov 1999).

Lex5000

The lexicon Lex5000 is based on the lexical resource WordNet, which maps word-forms to lexical entries, and semantic concepts called "synsets". Lex5000 consists of the following files and predicates:

The files noun.lex and verb.lex contain the predicates nlx/2, and vlx/2 respectively, which map noun and verb word-forms into lexical entries.
The files noun_path_i pl and verb_path_i.pl contain the predicates npath/2 and vpath/2 respectively, which map a lexical entry into one of the synsets to which it belongs, along with the list of its hypernyms, all ordered from the most general to the most specific one. So, there are at least as many clauses per lexical entry, as there are synsets that include that entry. Moreover, there may be more than one "path" of hypernyms per synset, since some of the synsets in WordNet have more than one hypernym. For the sake of brevity, the original synset indices used in WordNet have been replaced by shorter, yet unique identifiers.
The file i_map.pl contains a single predicate, imap/2 which maps the identifiers used in the predicates npath/2 and vpath/2 to the corresponding synset indices in WordNet1.6, e.g. imap(i10, '100011937').
The file ambig_lex.pl contains the all words for which no semantic tag has been provided, along with a single top-level predicate dict (WordForm, PoS, TopDownPath) serving as an interface to the whole lexicon.

The ratio between word-forms, lexical entries and semantic tags (paths of hypernyms) is 73:74:265 for the nouns, and 26:27:183 for the verbs. There are 84 different word-forms in the whole lexicon, and 468 combinations of a word-worm, PoS and semantic tag, which corresponds to an average of 5.57 tags per word-form.

TS5000

T5000 is also an artificial resource, generated from 12 sentence templates. The lexique used is the one present in Lex5000. The treebank contains 5046 sentences and their parse trees, in which nouns and verbs are semantically tagged. Each semantic tag is of the type used in Lex5000. Although the words in Lex5000 can have more than one semantic tag assigned, the treebank is generated with the help of an additional lexicon (disamb_lex_i.pl), from which only the correct semantic tag is taken for each lexical slot in the templates. The 12 templates are shown below, with the variable slots in bold face:

I want to book the first flight to London on Monday
I want to book the first flight to the capital of Bulgaria on Monday
I want to find a flat with a friend
I want to find a flat with a vista
I want to go to London by boat
I want to go to the museum at the end of the street
I want to go to the museum at the end of the week
I want to go to the museum by the river
I want to go to the museum on Monday
I want to go to the museum on the hill
I want to rent a limousine with a chauffeur
I want to rent a limousine with a friend

Experiments with LAPIS

The ability of LAPIS to induce parsers from treebanks was tested on T5000 treebank. The combination of LAPIS and WordNet proves its efficiency for the learning of specialised parsers, containing a very limited amount of nondeterminism. The setting used leads to the construction of parsers based on grammars with partially lexicalised and/or unfolded rules, with the additional help of lexical semantic constraints. The parsers learned with LAPIS have semantic constraints and have higher precision as compared to a CFG parser, while maintaining the same level of accuracy.

Bibliography

Dimitar Kazakov. Combining LAPIS and WordNet for the learning of LR parsers with optimal semantic constraints. In Saso Dzeroski and Peter Flach, editors, The Ninth International Workshop ILP-99, Bled, Slovenia, 1999. Springer-Verlag.
D. Kazakov. (1996) Natural Language Processing Applications of Machine Learning. PhD thesis. CTU Prague.

D. Kazakov. (1996) An Inductive Approach to Natural Language Parser Design. In Proceedings of NeMLaP 2: 209-217. ISBN 975-7679-16-X.

George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. (1993) Introduction to WordNet: An on-line lexical database. Technical report, University of Princeton.

back to index

Application domain:	Learning natural language parsers
Source:	T5000 treebank + lexicon L5000 (artificially generated data)
Dataset size:	5046 parsed sentences + a lexicon with 84 word-forms, and 468 combinations of word-form, part of speech, and semantic tag.
Data format:	Prolog
Systems used:	LAPIS
References:	(Kazakov 1999)
Pointers:	`kazakov@cs.york.ac.uk`, `step@labe.felk.cvut.cz`