|Application domain:||Predicting biodegradability of chemical compounds|
We used a database of 328 structurally diverse and widely used (commercial) chemicals described in a handbook of degradation rates. Complete data on the structure of the chemicals (SMILES notation) was available, as well as data on the overall, biotic and abiotic degradation rates in four environmental compartments (soil, air, surface water and ground water). We built models for biotic degradation in surface water, predicting the logarithm of the half-life time of aqueous biodegradation. Half-life times were measured for some compounds and estimated by experts for others: in the latter case, an upper and a lower bound were given and we took the arithmetic mean of these.
We used several propositional and ILP methods for decision tree, regression tree and rule induction. In addition to a few global features, such as molecular weight, the main information used for learning was the data on the structure of compounds, i.e., the atoms within a molecule and the connections/bonds between them. Domain knowledge about a variety of functional groups and substructures was used. ILP systems use this data directly, while propositional systems use features derived from it, which represent the compounds' structure approximately, but not completely. Several of the derived models perform better than a state-of-the-art biodegradability prediction system based on linear regression.