Diterpene structure elucidation (GMD + LAI)

Structure elucidation of compounds isolated from plants, fungi, bacteria or other organisms is a common problem in natural product chemistry. There are many useful spectroscopic methods of getting information about chemical structures, mainly nuclear magnetic resonance (NMR) and mass spectroscopy. The interpretation of these spectra normally requires specialists with detailed spectroscopic knowledge and a great experience in natural products chemistry.

In ordinary C-NMR-spectra only the resonance frequency is observed. The absolute value of the resonance frequency provides information about chemical structure and (important for prediction and database maintenance purposes) is comparatively insensitive to varying experimental conditions. Additional measurements can be used to determine the number of hydrogens directly connected to a particular carbon atom. This number is the so-called multiplicity of the signal: s stands for singulet, which means there is no proton (i.e., hydrogen) connected to the carbon; d stands for a doublet with one proton connected to the carbon; t stands for a triplet with two protons and q for a quartet with three protons bound to the carbon atom.

Application domain: Diterpene structure elucidation

Further specification: ILP96 publication, available on-line

Pointers: ftp://ftp.gmd.de/ml-archive/GMD/papers/ML76a.ps.gz

Data complexity: about 40.000 facts

Data format: not available

References: (Dzeroski et al. 1996)

The data (GMD)

The dataset used in the ILP project is concerned with Diterpenes which are one of a few fundamental classes of natural products with about 5000 members known. The skeleton of every diterpene contains 20 carbon atoms. Sometimes there are additional groups linked to the diterpene skeleton by an oxygen atom with the possible effect of increasing the carbon atom count to more than 20 per diterpene. About 200 different diterpene skeletons are known so far, but some of them are only represented by one member compound. Most of the diterpenes belong to one of 20 common skeleton types.

Structure elucidation of diterpenes from C-NMR-Spectra can be separated into three main stages:

identification of residues (ester and/or glycosides),
identification of the diterpene skeleton, and
arrangement of the residues on the skeleton.

The data that were collected are concerned with the second stage, the identification of the skeleton. A skeleton is a unique connection of carbon atoms each with a specific atom number and, normalized to a pure skeleton molecule without residues, a certain multiplicity (s, d, t or q).

More precisely, the task is to identify the skeleton (type) of diterpenoid compounds, given their C-NMR-Spectra that include the multiplicities and the frequencies of the skeleton atoms. This task is usually done manually by human experts with specialized background knowledge on peak patterns and chemical structures. In the process, each of the 20 skeletal atoms is assigned an atom number that corresponds to its proper place in the skeleton and the diterpene is classified into one of the possible skeleton types. Whenever the atom numbers are given, the resulting problem is of a propositional nature. However, as the assignment of atom numbers is a difficult process in itself, and possibly indistinguishable from the classification process, the real problem to be solved is inherently an ILP problem.

The collected data contain information on 1503 diterpenes with known structure, stored in three relations atom, bond, and nmr. The first relation specifies to which element an atom in a given compound belongs. The second relation specifies which atoms are bound and in what way in a given compound. The nmr relation stores the measured -NMR-Spectra. For each of the 20 carbon atoms in the diterpene skeleton, it contains the atom number, its multiplicity and frequency. Additional unary predicates describe the classes to which each compound belongs (25 classes).

The experiments (GMD+LAI)

This description forms part of a larger set of experiments that were conducted at GMD in collaboration with LAI. The experiments are based on the data described above. For these experiments, both propositional versions and ILP-versions of the problem were used. The propositional variant contains the manually assigned atom numbers for each atom of a compound, thus containing an important part of the solution already. The ILP variant of the problem did not contain this information. Experiments were conducted using C4.5, FOIL and RIBL.

In the experiments with RIBL, the same formulations as in the experiments with FOIL were used. As it stores all training instances, RIBL achieved 100% accuracy given the entire data set for both training and testing. To estimate performance on unseen cases, 10-fold cross validations were performed. The result (with comparative figures for the other programs) were as follows.

Given only the red relation, i.e., only relational information without hand-coded propositional information, RIBL achieved a very high accuracy that already improved on the accuracy of both FOIL and even on the accuracy of C4.5 with propositional info available. When provided with this information in addition to red, the performance of RIBL increases even further. This accuracy figure is in the range with which experts perform this classification task. Further experiments will be directed to extracting explicit rule information from these results to help understandability and explainability.

References

S. Dzeroski, S. Schulze-Kremer, K. Heidtke, K. Siems, and D. Wettschereck. Applying ILP to diterpene structure elucidation from C NMR spectra. Proc. 6th International Workshop on Inductive Logic Programming, (Stockholm, August 1996), 59-75, DSV, Stockholm University Report No. 96-019.

back to index

Application domain:	Diterpene structure elucidation
Further specification:	ILP96 publication, available on-line
Pointers:	ftp://ftp.gmd.de/ml-archive/GMD/papers/ML76a.ps.gz
Data complexity:	about 40.000 facts
Data format:	not available
References:	(Dzeroski et al. 1996)