Application domain: | Diterpene structure elucidation |
Source: | GMD |
Dataset size: | around 40.000 facts |
Data format: | not available publicly |
Systems Used: | FOIL, RIBL, ICL, TILDE |
Pointers: | steffen@chemie.fu-berlin.de |
References: | (Dzeroski et al. 1997, 1998) |
The dataset used in the ILP2 project is concerned with Diterpenes which are one of a few fundamental classes of natural products with about 5000 members known. The skeleton of every diterpene contains 20 carbon atoms. Sometimes there are additional groups linked to the diterpene skeleton by an oxygen atom with the possible effect of increasing the carbon atom count to more than 20 per diterpene. About 200 different diterpene skeletons are known so far, but some of them are only represented by one member compound. Most of the diterpenes belong to one of 20 common skeleton types.
Structure elucidation of diterpenes from 13C-NMR-Spectra can be separated into three main stages:
More precisely, the task is to identify the skeleton (type) of diterpenoid compounds, given their 13C-NMR-Spectra that include the multiplicities and the frequencies of the skeleton atoms. This task is usually done manually by human experts with specialized background knowledge on peak patterns and chemical structures. In the process, each of the 20 skeletal atoms is assigned an atom number that corresponds to its proper place in the skeleton and the diterpene is classified into one of the possible skeleton types. Whenever the atom numbers are given, the resulting problem is of a propositional nature. However, as the assignment of atom numbers is a difficult process in itself, and possibly indistinguishable from the classification process, the real problem to be solved is inherently an ILP problem.
The collected data contain information on 1503 diterpenes with known structure, stored in three relations atom, bond, and nmr. The first relation specifies to which element an atom in a given compound belongs. The second relation specifies which atoms are bound and in what way in a given compound. The nmr relation stores the measured 13C-NMR-Spectra. For each of the 20 carbon atoms in the diterpene skeleton, it contains the atom number, its multiplicity and frequency. Additional unary predicates describe the classes to which each compound belongs (25 classes).
In addition to FOIL and RIBL, TILDE and ICL were applied to the problem of classification of diterpene NMR spectra. Without using atom numbers, they achieved accuracies of 90.4% and 86.0%, respectively. These are comparable to the best results achieved so far (91.2% by RIBL), at the additonal advantage of producing explicit symbolic classification rules. A trade-off between accuracy and comprehensibility / complexity is possible when using TILDE: larger (less pruned trees) are more accurate but less comprehensible, while smaller (more pruned) trees are less accurate but easier to understand.
The accuracies of the different ILP approaches on the three different formulations of the problem are given below. red refers to the relational representation, prop to a propositional version obtained by counting atmos of the four multiplicities, and red+prop to the version where both kinds of information are available.
Problem/System | FOIL | RIBL | TILDE | ICL |
red | 46.5% | 86.5% | 81.6% | 65.3% |
prop | 70.1% | 79.0% | 78.5% | 79.1% |
red+prop | 78.3% | 91.2% | 90.4% | 86.0% |