Prediction of rodent carcinogenicity bioassays (OxUni)

Application domain: Prediction of rodent carcinogenicity bioassays
Further specification: Data set describing 330 chemicals with atom/bond facts
Data complexity: 544 KB (ASCII)
Data format: Progol

The data

The problem is to predict the carcinogenicity of a set of 330 diverse organic compounds. This dataset is based on years of work by the US National Toxicology Program (NTP) at the National Institute of Health Sciences. Anyone who gets good predictive results on the data is urged to publish his/her results and to take part in the next round of NTP predictions. The data was obtained by testing the chemicals on rodents, each trial takes several years and hundreds of animals. It was therefore very expensive to obtain, both in terms of money and pain to the animals. It is is very important to be able to predict the carcinogenicty of chemicals because there are millions in the environment for which no tests have been done or are likely to be done.

An understanding of the molecular mechanisms of chemical carcinogenesis is central to the prevention of many environmentally induced cancers. One approach is to form Structure-Activity Relationships (SARS) that empirically relate molecular structure with the ability to cause cancer. This work has been greatly advanced by the long-term carcinogenicity tests of compounds in rodents by the National Toxicology Program conducted by the National Institute of Environmental Health Sciences. These tests have resulted in a database of more than 300 chemicals that have been shown to be carcinogenic or otherwise. These compounds present a problem for many conventional SAR techniques because (1) they are structural extremely diverse; and (2) many different molecular mechanisms are thought to be involved. Most conventional SAR methods are designed to deal with compounds having a common molecular template and similar molecular mechanisms of action. It is of interest to see if ILP programs, which do no make such assumptions, can obtain rules that predict accurately the causes for carcinogenic activity amongst these molecules.

The data are important as they represent a wider class of molecules than those considered previously by any ILP algorithm. Further, the volume of data is also greater than that attempted earlier (for example, in the mutagenicity experiments with Progol).

The compilation of 330 chemicals was taken from the Collective data base of the National Cancer Institute and National Toxicology Program (Huff and Haseman 1991). The compounds used were all the organic compounds that had completed NTP reports at the time of this work. Inorganic compounds were not included as it was considered that there are too few of them to allow meaningful generalisations. Of the 330 compounds, 182 (55 %) are classified carcinogenic, and the remaining 148 non-carcinogenic. Carcinogenicity is determined by analysis of long term rodent bioassays. Compounds classified by the NTP as equivocal are considered non-carcinogenic, this allows direct comparison with other predictive methods. No analysis was made of differences in incidence between rat and mouse cancer, or the role of sex, or particular organ sites.

The experiments

The training set consisted of 291 compounds, 161 (55 %) carcinogens and 130 non-carcinogens. In addition to this train/test split, a five fold cross-validation split of the 330 compounds was tested for a more accurate estimate of the accuracy of Progol. The compounds were randomly split into 5 sets, and Progol was successively trained on 4 of the splits and tested on the remaining split.

Using the same background knowledge of generic chemical structures as that supplied for earlier experiments on predicting mutagenicity, SAR rules obtained with Progol were compared against other widely used SAR methods. Accuracies of 64% (train/test split) and 63% (cross-validation) were achieved by Progol. No SAR models (of either human or machine origin) were significantly more accurate than Progol. Progol's SAR was also the most accurate method that did not use data from biological tests on rodents (these data were not available). A set of structural alerts for carcinogenicity was generated automatically and the chemical rationale for them investigated. Unlike other SAR methods, the Progol alerts are statistically independent of those available from an existing carcinogenesis test based on Salmonella mutagenecity.


  1. Huff, J. and Haseman J. Long-term chemical carcinogenesis experiments for identifying experiments for identifying potential human cancer hazards. Environmental Health Perspectives, 96: 23-31, 1991.
  2. Ross D. King and Ashwin Srinivasan. Prediction of Rodent Carcinogenicity Bioassays from Molecular Structure using Inductive Logic Programming. Environmental Health Perspectives, 104(5): 1031-1040, 1996.

back to index