Unsupervised Learning of Word Segmentation Rules (York)

Application domain: Natural Language Processing
Source: Dataset of French verbs (original dataset based on (Bescherelle 1, 1980))
Dataset size: 221 paradigms, 9587 different word-forms
Data format: Prolog
Systems used: GA&ILP hybrid learner of word segmentation rules
Pointers: kazakov,suresh@cs.york.ac.uk

The Dataset

The dataset used in the experiments described (Kazakov and Manandahar) consists of two parts. The first part contains the whole paradigms of 221 regular (first and second conjugation) French verbs, i.e. for each of these verbs its all different synthetic forms are enumerated. These regular verbs can be grouped in the classes according to the changes their stem undergoes. These classes are shown in Table 1, where each class is represented by the infinitive of one of the verbs belonging to the class. The table also shows the number of verbs in each of the conjugational classes. The `irregularity' of the verbs in the second part of the dataset also shows in the higher number of different word-forms per paradigm, as compared to the regular case (see Table 2). The words in this first part of the dataset have been annotated with both their invariable- and variable-stem segmentations.

The second part of the dataset comprises the paradigms of the irregular (III conjugation) verbs aller (to go), être (to be), and examples of the following conjugational classes boire (2 verbs), cuire (5 verbs), écrire (4 verbs), mettre (3 verbs), rendre (4 verbs), recevoir (3 verbs), courir (3 verbs), and tenir (5 verbs). Many of the verbs within the same class are derived from each other by prefixation. There are 31 irregular verb paradigms in the dataset.

Table 1: Regular verb conjugation classes in the French Verb dataset
Class Conj. Short stem alternations No of verbs
aimer I invariable 128
placer I plac-/plaç- 13
manger I mang-/mange- 12
peser I pes-/pèse- 6
céder I céd-/cèd- 6
jeter I jet-/jett- 3
appeler I appel-/appell- 4
finir II invariable 49
Total 221

Table 2: French Verb dataset statistics
Dataset Word-forms Different word-forms Diff. w-fs per paradigm
Regular verbs 11322 8364 37.8
Irregular verbs 1572 1223 39.5
Reg+Irreg. 12894 9587 38.0

The Application

The hybrid GA&ILP approach combines unsupervised and supervised learning techniques for generation of word segmentation rules from an unannotated list of words. A bias for word segmentation is reformulated as a fitness function of a simple genetic algorithm, which is used for the search of a word list segmentation that corresponds to the best bias value (Kazakov 1997). In the second phase, the list of segmented words obtained from the genetic algorithm is used as an input for two decision list learning algorithms, namely FOIDL (Mooney and Califf 1995) and CLOG (Manandhar et al. 1998). The result is a logic program in a decision list representation that can be used for segmentation of isolated words, both from the training corpus and unseen ones. This program also contains a number of word constituents, which are characteristic for the given language and can be used in other learning tasks related to word morphology.

The so obtained decision list consists of a number of exceptions, each defining the segmentation of a sigle word, and a set of rules. It can be observed that, in general, the exceptions in the decision list correspond to words which have not been segmented by the GA in an optimal way w.r.t. the segmentation bias. We show that a better segmentation can be obtained if exceptions are removed from the decision list and only rules are used for the segmentation of the input list of words.

The hybrid approach to word segmentation is an efficient combination of GA and ILP. The former provides a first approximation of the concept learned and reduce dramatically the search space. The latter, on its turn, refines the target concept reformulating it as an ordered list of rules applicable on isolated words, i.e. these rules do not require other words to provide the segmentation context needed by the GA. The results of the tests show that a set of rules for word segmentation can be learned from a limited amount of unanotated words, as opposed to other approaches where very large and/or tagged corpora are used.


  1. Bescherelle 1. Hatier-Paris, 1980.
  2. Dimitar Kazakov. Unsupervised learning of naïve morphology with genetic algorithms. In W. Daelemans, A. van den Bosch, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105-112, Prague, April 1997.
  3. Dimitar Kazakov and Suresh Manandhar. Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming. Submitted to Machine Learning Journal.
  4. Suresh Manandhar, Saso Dzeroski, and Tomaz Erjavec. Learning Multilingual Morphology with CLOG. In The Eighth International Conference on Inductive Logic Programming (ILP'98), Madison, Wisconsin, USA, 1998.
  5. Raymond J. Mooney and Mary Elaine Califf. Induction of first-order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, June 1995.

back to index