In ordinary C-NMR-spectra
only the resonance frequency is observed. The absolute value of the resonance
frequency provides information about chemical structure and (important
for prediction and database maintenance purposes) is comparatively insensitive
to varying experimental conditions. Additional measurements can be used
to determine the number of hydrogens directly connected to a particular
carbon atom. This number is the so-called multiplicity of the signal: s
stands for singulet, which means there is no proton (i.e., hydrogen) connected
to the carbon; d stands for a doublet with one proton connected
to the carbon; t stands for a triplet with two protons and q
for a quartet with three protons bound to the carbon atom.
Application domain: | Diterpene structure elucidation |
Further specification: | ILP96 publication, available on-line |
Pointers: | ftp://ftp.gmd.de/ml-archive/GMD/papers/ML76a.ps.gz |
Data complexity: | about 40.000 facts |
Data format: | not available |
References: | (Dzeroski et al. 1996) |
Structure elucidation of diterpenes from C-NMR-Spectra
can be separated into three main stages:
More precisely, the task is to identify the skeleton (type)
of diterpenoid compounds, given their C-NMR-Spectra
that include the multiplicities and the frequencies of the skeleton atoms.
This task is usually done manually by human experts with specialized background
knowledge on peak patterns and chemical structures. In the process, each
of the 20 skeletal atoms is assigned an atom number that corresponds to
its proper place in the skeleton and the diterpene is classified into one
of the possible skeleton types. Whenever the atom numbers are given, the
resulting problem is of a propositional nature. However, as the assignment
of atom numbers is a difficult process in itself, and possibly indistinguishable
from the classification process, the real problem to be solved is inherently
an ILP problem.
The collected data contain information on 1503 diterpenes
with known structure, stored in three relations atom, bond,
and nmr. The first relation specifies to which element an atom
in a given compound belongs. The second relation specifies which atoms
are bound and in what way in a given compound. The nmr relation
stores the measured -NMR-Spectra.
For each of the 20 carbon atoms in the diterpene skeleton, it contains
the atom number, its multiplicity and frequency. Additional unary predicates
describe the classes to which each compound belongs (25 classes).
This description forms part of a larger set of experiments that were conducted at GMD in collaboration with LAI. The experiments are based on the data described above. For these experiments, both propositional versions and ILP-versions of the problem were used. The propositional variant contains the manually assigned atom numbers for each atom of a compound, thus containing an important part of the solution already. The ILP variant of the problem did not contain this information. Experiments were conducted using C4.5, FOIL and RIBL.
In the experiments with RIBL, the same formulations as in the experiments with FOIL were used. As it stores all training instances, RIBL achieved 100% accuracy given the entire data set for both training and testing. To estimate performance on unseen cases, 10-fold cross validations were performed. The result (with comparative figures for the other programs) were as follows.
Given only the red relation, i.e., only relational information without hand-coded propositional information, RIBL achieved a very high accuracy that already improved on the accuracy of both FOIL and even on the accuracy of C4.5 with propositional info available. When provided with this information in addition to red, the performance of RIBL increases even further. This accuracy figure is in the range with which experts perform this classification task. Further experiments will be directed to extracting explicit rule information from these results to help understandability and explainability.