Application of ILP methods to semantic knowledge acquisition (LRI)

Application domain: Learning ontologies and case frames from specification texts
Source: Cooking Recipes in French
Dataset size: 3.4 Mo, 93 838 sentences
Data format: Syntactic frames
Systems Used: ASIUM
Pointers: http://www.lri.fr/Francais/Recherche/ia/

Application domain

The task consists in the acquisition of case frames and ontology from syntactic analysis of specification texts. Applications requiring case frames and ontologies are numerous. The Dassault Aviation company we are collaborating with is mainly interested in checking semantic consistancy of texts and in terminology acquisition for standardization and indexing purposes. Long term collaboration should lead to acquire semantic models of texts. Acquisition of case frames and ontologies is the first step in this direction.

Source

The recipes have been searched on web. Syntactic analysis of sentences are the examples case frames and ontology are learned from. The syntactic analysis is performed by SYLEX (Ingenia). Noise comes from typing and analysis errors.

Dataset size

The dataset contains 93 838 sentences (3.4 Mo). An error file containing spelling mistakes and file that contains words not to be taken into account when learning are also provided.

Data format

There is one example per sentence.
Verb(ex1,pocher), subject(ex1,-), object (ex1,oeuf), ac(ex1,dans,eau).

There is no explicit distinction between sentences extracted from different texts. Examples are identified by a unique number (ex1).

Systems Used

ASIUM is a system based on clustering and cooperative methods.

Experiments with ASIUM (LRI)

Experimentation has been performed on whole set of data, i.e, 93 838 sentences. It takes 2 hours on a Sun Spark Station to perform the syntactic analysis with SYLEX. The similarity threshold for clusters has been set to 0.4. ASIUM has generated an ontology of 101 concepts hierachically organized as followed:

Level in the ontolgy Number of concepts
1 41
2 25
3 21
4 14

ASIUM has paralelly learned a base of 600 case frames where the semantic features are filled by ontology concepts.

The concepts learned have been validated and labeled by us. This first validation shows that all 101 concepts were relevant to the domain. Further validation should test the case frames on new texts.


back to index