Diterpene structure elucidation (GMD + LAI + KUL)

Application domain: Diterpene structure elucidation

Source: GMD

Dataset size: around 40.000 facts

Data format: not available publicly

Systems Used: FOIL, RIBL, ICL, TILDE

Pointers: steffen@chemie.fu-berlin.de

References: (Dzeroski et al. 1997, 1998)

The data (GMD)

The dataset used in the ILP2 project is concerned with Diterpenes which are one of a few fundamental classes of natural products with about 5000 members known. The skeleton of every diterpene contains 20 carbon atoms. Sometimes there are additional groups linked to the diterpene skeleton by an oxygen atom with the possible effect of increasing the carbon atom count to more than 20 per diterpene. About 200 different diterpene skeletons are known so far, but some of them are only represented by one member compound. Most of the diterpenes belong to one of 20 common skeleton types.

Structure elucidation of diterpenes from ¹³C-NMR-Spectra can be separated into three main stages:

identification of residues (ester and/or glycosides),
identification of the diterpene skeleton, and
arrangement of the residues on the skeleton.

The data that were collected are concerned with the second stage, the identification of the skeleton. A skeleton is a unique connection of carbon atoms each with a specific atom number and, normalized to a pure skeleton molecule without residues, a certain multiplicity (s, d, t or q).

More precisely, the task is to identify the skeleton (type) of diterpenoid compounds, given their ¹³C-NMR-Spectra that include the multiplicities and the frequencies of the skeleton atoms. This task is usually done manually by human experts with specialized background knowledge on peak patterns and chemical structures. In the process, each of the 20 skeletal atoms is assigned an atom number that corresponds to its proper place in the skeleton and the diterpene is classified into one of the possible skeleton types. Whenever the atom numbers are given, the resulting problem is of a propositional nature. However, as the assignment of atom numbers is a difficult process in itself, and possibly indistinguishable from the classification process, the real problem to be solved is inherently an ILP problem.

The collected data contain information on 1503 diterpenes with known structure, stored in three relations atom, bond, and nmr. The first relation specifies to which element an atom in a given compound belongs. The second relation specifies which atoms are bound and in what way in a given compound. The nmr relation stores the measured ¹³C-NMR-Spectra. For each of the 20 carbon atoms in the diterpene skeleton, it contains the atom number, its multiplicity and frequency. Additional unary predicates describe the classes to which each compound belongs (25 classes).

The experiments (GMD+LAI+KUL)

In addition to FOIL and RIBL, TILDE and ICL were applied to the problem of classification of diterpene NMR spectra. Without using atom numbers, they achieved accuracies of 90.4% and 86.0%, respectively. These are comparable to the best results achieved so far (91.2% by RIBL), at the additonal advantage of producing explicit symbolic classification rules. A trade-off between accuracy and comprehensibility / complexity is possible when using TILDE: larger (less pruned trees) are more accurate but less comprehensible, while smaller (more pruned) trees are less accurate but easier to understand.

The accuracies of the different ILP approaches on the three different formulations of the problem are given below. red refers to the relational representation, prop to a propositional version obtained by counting atmos of the four multiplicities, and red+prop to the version where both kinds of information are available.

Problem/System FOIL RIBL TILDE ICL

red 46.5% 86.5% 81.6% 65.3%

prop 70.1% 79.0% 78.5% 79.1%

red+prop 78.3% 91.2% 90.4% 86.0%

References

S. Dzeroski, S. Schulze-Kremer, K. Heidtke, K. Siems, and D. Wettschereck. Diterpene structure elucidation from ¹³C NMR spectra with machine learning. In N. Lavrac, E. Keravnou and B. Zupan, editors, Intelligent Data Analysis in Medicine and Pharmacology, pages 207-225, Kluwer Academic Publishers, Dordrecht, 1997.

S. Dzeroski, S. Schulze-Kremer, K. Heidtke, K. Siems, D. Wettschereck, H. Blockeel. Diterpene Structure Elucidation from ¹³C NMR Spectra with Inductive Logic Programming. Applied Artificial Intelligence, 1998.

back to index

Application domain:	Diterpene structure elucidation
Source:	GMD
Dataset size:	around 40.000 facts
Data format:	not available publicly
Systems Used:	FOIL, RIBL, ICL, TILDE
Pointers:	`steffen@chemie.fu-berlin.de`
References:	(Dzeroski et al. 1997, 1998)

Problem/System	FOIL	RIBL	TILDE	ICL
red	46.5%	86.5%	81.6%	65.3%
prop	70.1%	79.0%	78.5%	79.1%
red+prop	78.3%	91.2%	90.4%	86.0%