Application domain: | Prediction of rodent carcinogenicity bioassays |
Further specification: | Data set describing 330 chemicals with atom/bond facts |
Pointers: | http://www.gmd.de/ml-archive/ILP/public/data/chemical_cancer |
Data complexity: | 544 KB (ASCII) |
Data format: | Progol |
An understanding of the molecular mechanisms of chemical carcinogenesis is central to the prevention of many environmentally induced cancers. One approach is to form Structure-Activity Relationships (SARS) that empirically relate molecular structure with the ability to cause cancer. This work has been greatly advanced by the long-term carcinogenicity tests of compounds in rodents by the National Toxicology Program conducted by the National Institute of Environmental Health Sciences. These tests have resulted in a database of more than 300 chemicals that have been shown to be carcinogenic or otherwise. These compounds present a problem for many conventional SAR techniques because (1) they are structural extremely diverse; and (2) many different molecular mechanisms are thought to be involved. Most conventional SAR methods are designed to deal with compounds having a common molecular template and similar molecular mechanisms of action. It is of interest to see if ILP programs, which do no make such assumptions, can obtain rules that predict accurately the causes for carcinogenic activity amongst these molecules.
The data are important as they represent a wider class of molecules than those considered previously by any ILP algorithm. Further, the volume of data is also greater than that attempted earlier (for example, in the mutagenicity experiments with Progol).
The compilation of 330 chemicals was taken from the Collective data base of the National Cancer Institute and National Toxicology Program (Huff and Haseman 1991). The compounds used were all the organic compounds that had completed NTP reports at the time of this work. Inorganic compounds were not included as it was considered that there are too few of them to allow meaningful generalisations. Of the 330 compounds, 182 (55 %) are classified carcinogenic, and the remaining 148 non-carcinogenic. Carcinogenicity is determined by analysis of long term rodent bioassays. Compounds classified by the NTP as equivocal are considered non-carcinogenic, this allows direct comparison with other predictive methods. No analysis was made of differences in incidence between rat and mouse cancer, or the role of sex, or particular organ sites.
Using the same background knowledge of generic chemical structures as that supplied for earlier experiments on predicting mutagenicity, SAR rules obtained with Progol were compared against other widely used SAR methods. Accuracies of 64% (train/test split) and 63% (cross-validation) were achieved by Progol. No SAR models (of either human or machine origin) were significantly more accurate than Progol. Progol's SAR was also the most accurate method that did not use data from biological tests on rodents (these data were not available). A set of structural alerts for carcinogenicity was generated automatically and the chemical rationale for them investigated. Unlike other SAR methods, the Progol alerts are statistically independent of those available from an existing carcinogenesis test based on Salmonella mutagenecity.