Predicting 3D Protein Structure (LAI)

Application domain:	Predicting 3D Protein Structure
Further specification:	Data sets
Pointers:	Contact Igor Mozetic Igor.Mozetic@ijs.si
Data complexity:
Data format:	Prolog

One of the greatest challenges in molecular biology is 3D protein structure prediction from the sequence of aminoacids in a protein chain (primary structure). 3D structure is important for the determination of protein function and can be used to change it (for drugs). So far, the prevailing method for protein structure determination is X-ray diffraction. However, the method is very expensive and time consuming, and sometimes even inapplicable due to problems with protein crystallization. At the moment, there are about 500 proteins for which their 3D high resolution structure is known, but there are more then 50.000 known proteins. A major problem for pharmaceutical and medical sciences is the fact that the number of known proteins grows much faster then the number of proteins with a known 3D structure.

Protein structures can be represented by the following hierarchical scheme:

primary - a sequence of aminoacids which form a protein.
secondary - each aminoacid is either in an alpha-helix, a beta-sheet, a turn, a loop, etc.
tertiary - determines positions of elements of the secondary structure, i.e., absolute Cartesian coordinates of each protein atom.
quartary - for proteins, which contain several aminoacid chains, this structure represents their spatial relations.

There is a large number of possible conformations when one intends to determine the 3D protein structure from its primary structure, but only one of them is actually stable in reality. For each aminoacid in a protein backbone, there are two atomic bonds with very low energy barrier for rotation around them. The final protein structure is therefore determined by interactions between side-chains of consecutive aminoacids. Some sequences of the primary structure fold into elements of the secondary structure, and these in turn fold into elements of the tertiary and/or quartary structure. Eventually, each protein atom position is determined by three coordinates in space. In reality, the folding process from initial random coil to the final structure takes a time on the order of a second. This means that the folding process cannot be simulated with methods of molecular dynamics since they would require resources of several million years of computer time on the fastest supercomputers available today.

The goal of the joint work with the end-user National Institute of Chemistry, Ljubljana (dr. Milan Hodoscek) is an application of ILP to a difficult problem of protein structure prediction. From the database which contains about 500 proteins with known structures, the goal is to find general rules valid for protein structures by using ILP methods. These rules will be then used to determine 3D protein structures either from experimental data, or with theoretical methods. In this way we intend to (at least partially) automate a very time consuming process and consequently reduce the time and research costs needed for protein structure prediction.

The Brookhaven Protein Data Bank (PDB) consists of 3D coordinates of all atoms of around 5000 proteins, but the majority of them are only small variations of 500 considerably different proteins (PDB, Protein Data Bank, Brookhaven National Lab, USA; http://www.pdb.bnl.gov). Our approach is based on a systematic analysis of the PDB. The database analysis and prediction of the secondary structure consists of several steps. The approach is of an iterative nature, however, since the results of later steps typically require to repeat some former steps. Therefore, we will try to automate individual steps as much as possible.

The current work is concerned with the selection of a subset of proteins from the PDB representing considerably different proteins which will be used for data analysis using a variety of methods, including ILP. The transformation of this data set into symbolic form (the so-called Symbolic Protein Data Base - SPDB), i.e. in Prolog, is currently under way.

Reference

I. Mozetic and M. Hodoscek. Symbolic Protein Data Base. Technical Report, Jozef Stefan Institute, 1997.

back to index