Application domain: | Learning natural language parsers |
Source: | T5000 treebank + lexicon L5000 (artificially generated data) |
Dataset size: | 5046 parsed sentences + a lexicon with 84 word-forms, and 468 combinations of word-form, part of speech, and semantic tag. |
Data format: | Prolog |
Systems used: | LAPIS |
References: | (Kazakov 1999) |
Pointers: | kazakov@cs.york.ac.uk, step@labe.felk.cvut.cz |
There is a history of research focussed on learning of shift-reduce parsers from syntactically annotated corpora by the means of machine learning techniques based on logic. The presence of lexical semantic tags in the treebank has proved useful for learning semantic constraints which limit the amount of nondeterminism in the parsers. The level of generality of the semantic tags used is of direct importance to that task. We combine the ILP system LAPIS with the lexical resource WordNet to learn parsers with semantic constraints. The generality of these constraints is automatically selected by LAPIS from a number of options provided by the corpus annotator. The performance of the parsers learned is evaluated on an original corpus (T5000) also described in the article (Kazakov 1999).
The lexicon Lex5000 is based on the lexical resource WordNet, which maps word-forms to lexical entries, and semantic concepts called "synsets". Lex5000 consists of the following files and predicates:
The ratio between word-forms, lexical entries and semantic tags (paths of hypernyms) is 73:74:265 for the nouns, and 26:27:183 for the verbs. There are 84 different word-forms in the whole lexicon, and 468 combinations of a word-worm, PoS and semantic tag, which corresponds to an average of 5.57 tags per word-form.
T5000 is also an artificial resource, generated from 12 sentence templates. The lexique used is the one present in Lex5000. The treebank contains 5046 sentences and their parse trees, in which nouns and verbs are semantically tagged. Each semantic tag is of the type used in Lex5000. Although the words in Lex5000 can have more than one semantic tag assigned, the treebank is generated with the help of an additional lexicon (disamb_lex_i.pl), from which only the correct semantic tag is taken for each lexical slot in the templates. The 12 templates are shown below, with the variable slots in bold face:
The ability of LAPIS to induce parsers from treebanks was tested
on T5000 treebank.
The combination of LAPIS and WordNet proves its efficiency for
the learning of specialised parsers, containing a very limited amount
of nondeterminism. The setting used leads to the construction of
parsers based on grammars with partially lexicalised and/or unfolded
rules, with the additional help of lexical semantic constraints.
The parsers learned with LAPIS have semantic constraints
and have higher precision as compared to a CFG parser, while maintaining
the same level of accuracy.