Analysis of Data from Protein Interaction Assays: Report on the data analysis workshop, 16-17 May, 2000, Ljubljana, Slovenia

Kathy Astrahantseff, Saso Dzeroski and Ashwin Srinivasan

The data analysis problem can be defined as follows. The data are results obtained using the yeast two-hybrid system for identifying protein-protein interactions (Fields and Song, 1989). The protein of interest (bait) is expressed as a hybrid protein (fused to a protein used as an indicator of interaction) in modified yeast cells together with a library of proteins expressed as hybrid proteins (fused to the other half of the indicator of interaction). The interaction of the bait protein with a prey protein brings the indicator system close together, and causes the activation of reporter genes. The DNA encoding the prey can be isolated from the yeast cells that have activated reporter genes, and the protein sequence can be identified based on the DNA sequence. This system is highly sensitive, but reports many spurious or artifactual bait-prey interactions as being real. These are refered to as "false positives" in bio-assay terminology. It is necessary to confirm each protein-protein interaction using an independent assay to eliminate these false positives, a process which is expensive and often requires many months. The aim of the data analysis workshop was to find characteristics of false positives that would enable automated screening of protein assay results for false positives and thus reduce the need for additional confirmation studies.

The data included

The amino acid sequences of 52 prey proteins,
The bait protein used for each reported prey,
The identification of the prey protein using the homolgy search algorithm BLAST (Altschul et al., 1997) and
The result of the confirmation assay when one had been carried out.

The interaction of the bait protein with 7 of the 52 prey proteins had been confirmed (these proteins are termed correct). Nine were considered as candidates that potentially interact with the bait (termed potential). The remaining 36 were considered unlikely to interact with the bait based on 4 criteria (these were termed incorrect). These criteria were designed to include documented information about the 2-hybrid system, published information about the bait and prey proteins and general information about protein-protein interaction:

DNA translating very short amino acid sequences which are less than 20 amino acids long,
Proteins resulting from an out of frame translation of a known gene,
Proteins resulting from translation of DNA in the upstream (5') or downstream (3') untranslated regions of known genes,
Mitochondrial or ribosomal proteins.

After an initial discussion, it was decided to consider three data analysis tasks:

Classification of proteins as correct (correctly reported as interacting) and incorrect (incorrectly reported as interacting) using decision trees.
Clustering of proteins, which should reveal similarities between the candidates that potentially interact, on one hand, and the correct, resp. incorrect, proteins.
Detecting motifs that appear in the incorrect proteins and could help filter false positives (incorrects)

The third task was identified by the domain expert as the most desirable approach to the problem. A propositional representation was considered for the first two tasks and a relational representation for the third.

A propositional representation was considered which specified the proportions of amino acids in the sequence with some property (hydrophobic, very hydrophobic, hydrophilic, positive, negative, neutral, large, small, tiny, polar, aliphatic, aromatic, hydrogen bond donor, hydrogen bond acceptor). The bait was also used as an attribute. Initially, the length of the sequence was considered, as well as the proportions of individual amino acids in the sequence, but these resulted in rules that were difficult to interpret by the domain expert.

See5 was applied to this representation to induce a tree that discriminates between proteins that correctly and incorrectly trigger enzyme signals (there were thus 7 correct and 36 incorrect examples). The unlabeled examples (potentials) were then classified using the induced tree given in Table 1.

**Table:** A decision tree for distinguishing between confirmed interactions (correct) and likely ''false positives'' (incorrect).
$\begin{table} \par\hrule\vspace{3mm} \begin{verbatim}proportion_large > 0.468... ...egative > 0.03623188: correct (7.0)\end{verbatim}\vspace{3mm} \hrule\end{table}$

The tree gives two more conditions that can potentially predict the proteins that inappropriately activate the reporter gene. The conditions are the following: proportion_large > 0.4680851 OR proportion_negative <= 0.03623188. These predicted only the proteins M14.3 and M7.2 as correctly activating the reporter gene.

The clustering puts all corrects in one cluster, except M1.1. The protein M13.3 is considered to be more similar to the corrects than M1.1. This identifies M13.3 as the most likely candidate for correctly activating the reporter gene, with M100 and M4 being the next most likely ones.

We first attempted to detect motifs common to the incorrects using a standard motif finding program (MEME, available from the University of San Diego Supercomputing Center). This program can only find motifs consisting of contiguous sequences of amino acids (and not sequences defined by properties of amino acids OR motifs involving gaps). MEME did not find any useful motifs.

Intial experiments were performed in using ILP to detect more general motifs (involving properties of amino acids AND/OR gaps) common to the incorrects. Six motifs were found that were sufficient to discriminate the incorrects from the corrects. The first one found was:


[Rule 1] [Pos cover = 9 Neg cover = 0]
motif(A) :-
  	run(very_hydrophobic,A,B), run(neutral,B,C).
[pos-neg] [9]

It states that the protein should start with a contiguous sequence of very hydrophobic amino acids, immediately followed by a sequence of neutral amino acids. This is very preliminary, but looks promising to the domain expert.

As further work, the predictions by the induced tree and clustering should be verified once the classes of the unlabeled proteins become available as results of confirmatory assays. They should also be evaluated on additional proteins resulting from 2-hybrid screening data. Motifs involving secondary structures should also be investigated. Properties of pairs of residues (doublets) should also be tried out.

References:

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W and Lipman DJ. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs" (1997) Nucleic Acids Res. 25:3389-40.
Fields S and Song O. "A novel genetic system to detect protein-protein interactions" (1989), Nature 340:245-6.

**Figure:** Summary of the results from four separate 2-hybrid screening events to identify proteins binding to portions of or the entire bait protein. The Y-axis gives the number of interacting proteins. Abbreviations used: mit. = mitochondrial, rib. = ribosomal, oof = out of frame, UTR = untranslated regions, aa = amino acids, C-term = carboxy terminus and N-term = amino terminus

About this document ...

This document was generated using the LaTeX2HTML translation Version 99.1 release (March 30, 1999)