Learning Slovene nominal paradigms (LAI)

Application domain: Learning Slovene nominal paradigms
Source: LAI
Dataset size: 37 datasets, >11 MB
Data format: Prolog
Systems Used: FOIDL
Pointers: Tomaz.Erjavec@ijs.si
References: (Dzeroski and Erajvec 1997)

The data

The task addressed here was to learn the inflectional morphology of Slovene nouns. Both analysis, i.e., generating the lemma (base form) form the oblique form, and synthesis, i.e., generating the oblique form from the lemma, were considered. Each morphosyntactic description (MSD) was considered as a separate learning task. In total, 37 MSDs were considered, taking into account that Slovene nouns have three possible genders, inflect for three numbers and six cases, and have 16 syncretic MSDs. Data from a lexicon of word-forms collected in association with a corpus of Slovene text were used. The numbers of distinct examples in the datasets for each MSD average 2808 for the 13 masculine, 2650 for the 13 feminine and 1286 for the 11 neuter MSDs. Orthographic representation of the words were used, i.e., words were represented as lists of letters. The predicate split(A,B,C) that splits words (lists of letters) into prefixes and suffixes was used as background knowledge.

Experiments with FOIDL

For each MSD, 200 examples from the lexicon were taken as training examples. Decision lists were learned for analysis and synthesis, which were then tested on the remaining lexicon entries of the respective MSD. Relatively good overall performance (average accuracy of 91.5%) was achieved, despite the fact that very limited background knowledge was used.

For synthesis, the average accuracy over all the 37 concepts is 91.4%, the feminine gender leading with the average of 97.8%, followed by the neuter with 96.9% and the masculine gender with 80.5%. The average number of rules is 16.4, with 9.1 exceptions and 7.3 generalizations.

For analysis, the average accuracy over all the MSDs is 91.5%, which is almost identical to the synthesis accuracy (91.4%). Here neuter scores highest 95.9%, while feminine scores 94.8% and masculine 84.5%. The average number of rules is 19.5 (16.4 for synthesis), with average 10.5 exceptions and 9.1 generalizations.

Reference

  1. S. Dzeroski and T. Erjavec: Induction of Slovene Nominal Paradigms. In Proc. 7th International Workshop on Inductive Logic Programming, pages 141-148. Springer, Berlin.


back to index