Application domain: | Predicting 3D Protein Structure |
Further specification: | Data sets |
Pointers: | Contact Igor Mozetic Igor.Mozetic@ijs.si |
Data complexity: | |
Data format: | Prolog |
One of the greatest challenges in molecular biology is 3D protein structure prediction from the sequence of aminoacids in a protein chain (primary structure). 3D structure is important for the determination of protein function and can be used to change it (for drugs). So far, the prevailing method for protein structure determination is X-ray diffraction. However, the method is very expensive and time consuming, and sometimes even inapplicable due to problems with protein crystallization. At the moment, there are about 500 proteins for which their 3D high resolution structure is known, but there are more then 50.000 known proteins. A major problem for pharmaceutical and medical sciences is the fact that the number of known proteins grows much faster then the number of proteins with a known 3D structure.
Protein structures can be represented by the following hierarchical scheme:
The goal of the joint work with the end-user National Institute of Chemistry, Ljubljana (dr. Milan Hodoscek) is an application of ILP to a difficult problem of protein structure prediction. From the database which contains about 500 proteins with known structures, the goal is to find general rules valid for protein structures by using ILP methods. These rules will be then used to determine 3D protein structures either from experimental data, or with theoretical methods. In this way we intend to (at least partially) automate a very time consuming process and consequently reduce the time and research costs needed for protein structure prediction.
The Brookhaven Protein Data Bank (PDB) consists of 3D coordinates of all atoms of around 5000 proteins, but the majority of them are only small variations of 500 considerably different proteins (PDB, Protein Data Bank, Brookhaven National Lab, USA; http://www.pdb.bnl.gov). Our approach is based on a systematic analysis of the PDB. The database analysis and prediction of the secondary structure consists of several steps. The approach is of an iterative nature, however, since the results of later steps typically require to repeat some former steps. Therefore, we will try to automate individual steps as much as possible.
The current work is concerned with the selection of a subset of proteins from the PDB representing considerably different proteins which will be used for data analysis using a variety of methods, including ILP. The transformation of this data set into symbolic form (the so-called Symbolic Protein Data Base - SPDB), i.e. in Prolog, is currently under way.