Unsupervised Learning of Word Segmentation Rules (York)
Application domain: |
Natural Language Processing |
Source: |
Dataset of French verbs
(original dataset based on (Bescherelle 1, 1980)) |
Dataset size: |
221 paradigms, 9587 different word-forms |
Data format: |
Prolog |
Systems used: |
GA&ILP hybrid learner of word segmentation rules |
Pointers: |
kazakov,suresh@cs.york.ac.uk |
The Dataset
The dataset used in the experiments described (Kazakov and Manandahar)
consists of two parts. The first part contains the whole paradigms of 221
regular (first and second conjugation) French verbs, i.e. for
each of these verbs its all different synthetic forms are enumerated.
These regular verbs can be grouped in the classes according to the
changes their stem undergoes. These classes are shown in
Table 1, where each class is represented by the
infinitive of one of the verbs belonging to the class. The table also
shows the number of verbs in each of the conjugational classes. The
`irregularity' of the verbs in the second part of the dataset also
shows in the higher number of different word-forms per paradigm, as
compared to the regular case (see Table 2). The
words in this first part of the dataset have been annotated with both
their invariable- and variable-stem segmentations.
The second part of the dataset comprises the paradigms of the
irregular (III conjugation) verbs aller (to go),
être (to be), and examples of the following conjugational
classes boire (2 verbs), cuire (5 verbs), écrire (4 verbs), mettre (3 verbs), rendre (4 verbs),
recevoir (3 verbs), courir (3 verbs), and tenir (5
verbs). Many of the verbs within the same class are derived from each
other by prefixation. There are 31 irregular verb paradigms in the
dataset.
Table 1:
Regular verb conjugation classes in the French Verb dataset
Class |
Conj. |
Short stem alternations |
No of verbs |
aimer |
I |
invariable |
128 |
placer |
I |
plac-/plaç- |
13 |
manger |
I |
mang-/mange- |
12 |
peser |
I |
pes-/pèse- |
6 |
céder |
I |
céd-/cèd- |
6 |
jeter |
I |
jet-/jett- |
3 |
appeler |
I |
appel-/appell- |
4 |
finir |
II |
invariable |
49 |
Total |
221 |
|
Table 2:
French Verb dataset statistics
Dataset |
Word-forms |
Different word-forms |
Diff. w-fs per paradigm |
Regular verbs |
11322 |
8364 |
37.8 |
Irregular verbs |
1572 |
1223 |
39.5 |
Reg+Irreg. |
12894 |
9587 |
38.0 |
The Application
The hybrid GA&ILP approach combines unsupervised and supervised
learning techniques for generation of word segmentation rules from an
unannotated list of words. A bias for word segmentation is
reformulated as a fitness function of a simple genetic algorithm,
which is used for the search of a word list segmentation that
corresponds to the best bias value (Kazakov 1997). In the second
phase, the list of segmented words obtained from the genetic
algorithm is used as an input for two decision list learning
algorithms, namely FOIDL (Mooney and Califf 1995)
and CLOG (Manandhar et al. 1998).
The result is a logic program in a
decision list representation that can be used for segmentation of
isolated words, both from the training corpus and unseen ones. This
program also contains a number of word constituents, which are
characteristic for the given language and can be used in other
learning tasks related to word morphology.
The so obtained decision list consists of a number of exceptions, each
defining the segmentation of a sigle word, and a set of rules. It can
be observed that, in general, the exceptions in the decision list
correspond to words which have not been segmented by the GA in an
optimal way w.r.t. the segmentation bias. We show that a better
segmentation can be obtained if exceptions are removed from the
decision list and only rules are used for the segmentation of the
input list of words.
The hybrid approach to word segmentation is an efficient combination
of GA and ILP. The former provides a first approximation of the
concept learned and reduce dramatically the search space. The latter,
on its turn, refines the target concept reformulating it as an ordered
list of rules applicable on isolated words, i.e. these rules do not
require other words to provide the segmentation context needed by the
GA. The results of the tests show that a set of rules for word
segmentation can be learned from a limited amount of
unanotated words, as opposed to other approaches where very large
and/or tagged corpora are used.
Bibliography
-
Bescherelle 1.
Hatier-Paris, 1980.
-
Dimitar Kazakov.
Unsupervised learning of naïve morphology with genetic
algorithms.
In W. Daelemans, A. van den Bosch, and A. Weijters, editors,
Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural
Language Processing Tasks, pages 105-112, Prague, April 1997.
-
Dimitar Kazakov and Suresh Manandhar.
Unsupervised learning of word segmentation rules with genetic
algorithms and inductive logic programming.
Submitted to Machine Learning Journal.
-
Suresh Manandhar, Saso Dzeroski, and Tomaz Erjavec.
Learning Multilingual Morphology with CLOG.
In The Eighth International Conference on Inductive Logic
Programming (ILP'98), Madison, Wisconsin, USA, 1998.
-
Raymond J. Mooney and Mary Elaine Califf.
Induction of first-order decision lists: Results on learning the
past tense of English verbs.
Journal of Artificial Intelligence Research, June 1995.