Application of ILP methods to semantic knowledge acquisition (LRI)
| Application domain:
||Learning ontologies and case frames from
||Cooking Recipes in French
| Dataset size:
||3.4 Mo, 93 838 sentences
| Data format:
| Systems Used:
The task consists in the acquisition of case frames and ontology from
syntactic analysis of specification texts. Applications requiring case
frames and ontologies are numerous. The Dassault Aviation company we
are collaborating with is mainly interested in checking semantic
consistancy of texts and in terminology acquisition for
standardization and indexing purposes. Long term collaboration should
lead to acquire semantic models of texts. Acquisition of case frames
and ontologies is the first step in this direction.
The recipes have been searched on web. Syntactic analysis of
sentences are the examples case frames and ontology are learned
from. The syntactic analysis is performed by SYLEX (Ingenia). Noise
comes from typing and analysis errors.
The dataset contains 93 838 sentences (3.4 Mo). An error file
containing spelling mistakes and file that contains words not to be
taken into account when learning are also provided.
There is one example per sentence.
Verb(ex1,pocher), subject(ex1,-), object (ex1,oeuf),
There is no explicit distinction between sentences extracted from
different texts. Examples are identified by a unique number (ex1).
ASIUM is a system based on clustering and cooperative methods.
Experiments with ASIUM (LRI)
Experimentation has been performed on whole set of data, i.e, 93 838
sentences. It takes 2 hours on a Sun Spark Station to perform the
syntactic analysis with SYLEX. The similarity threshold for clusters
has been set to 0.4. ASIUM has generated an ontology of 101 concepts
hierachically organized as followed:
|Level in the ontolgy
||Number of concepts
ASIUM has paralelly learned a base of 600 case frames where the
semantic features are filled by ontology concepts.
The concepts learned have been validated and labeled by us. This first
validation shows that all 101 concepts were relevant to the
domain. Further validation should test the case frames on new texts.
back to index