The data analysis problem can be defined as follows. The data are results obtained using the yeast two-hybrid system for identifying protein-protein interactions (Fields and Song, 1989). The protein of interest (bait) is expressed as a hybrid protein (fused to a protein used as an indicator of interaction) in modified yeast cells together with a library of proteins expressed as hybrid proteins (fused to the other half of the indicator of interaction). The interaction of the bait protein with a prey protein brings the indicator system close together, and causes the activation of reporter genes. The DNA encoding the prey can be isolated from the yeast cells that have activated reporter genes, and the protein sequence can be identified based on the DNA sequence. This system is highly sensitive, but reports many spurious or artifactual bait-prey interactions as being real. These are refered to as "false positives" in bio-assay terminology. It is necessary to confirm each protein-protein interaction using an independent assay to eliminate these false positives, a process which is expensive and often requires many months. The aim of the data analysis workshop was to find characteristics of false positives that would enable automated screening of protein assay results for false positives and thus reduce the need for additional confirmation studies.
The data included
The interaction of the bait protein with 7 of the 52 prey proteins had been confirmed (these proteins are termed correct). Nine were considered as candidates that potentially interact with the bait (termed potential). The remaining 36 were considered unlikely to interact with the bait based on 4 criteria (these were termed incorrect). These criteria were designed to include documented information about the 2-hybrid system, published information about the bait and prey proteins and general information about protein-protein interaction:
After an initial discussion, it was decided to consider three data analysis tasks:
The third task was identified by the domain expert as the most desirable approach to the problem. A propositional representation was considered for the first two tasks and a relational representation for the third.
A propositional representation was considered which specified the proportions of amino acids in the sequence with some property (hydrophobic, very hydrophobic, hydrophilic, positive, negative, neutral, large, small, tiny, polar, aliphatic, aromatic, hydrogen bond donor, hydrogen bond acceptor). The bait was also used as an attribute. Initially, the length of the sequence was considered, as well as the proportions of individual amino acids in the sequence, but these resulted in rules that were difficult to interpret by the domain expert.
See5 was applied to this representation to induce a tree that discriminates between proteins that correctly and incorrectly trigger enzyme signals (there were thus 7 correct and 36 incorrect examples). The unlabeled examples (potentials) were then classified using the induced tree given in Table 1.
The tree gives two more conditions that can potentially predict the proteins that inappropriately activate the reporter gene. The conditions are the following: proportion_large
>
0.4680851 OR proportion_negative
<=
0.03623188. These predicted only the proteins M14.3 and M7.2 as correctly activating the reporter gene.
The clustering puts all corrects in one cluster, except M1.1. The protein M13.3 is considered to be more similar to the corrects than M1.1. This identifies M13.3 as the most likely candidate for correctly activating the reporter gene, with M100 and M4 being the next most likely ones.
We first attempted to detect motifs common to the incorrects using a standard motif finding program (MEME, available from the University of San Diego Supercomputing Center). This program can only find motifs consisting of contiguous sequences of amino acids (and not sequences defined by properties of amino acids OR motifs involving gaps). MEME did not find any useful motifs.
Intial experiments were performed in using ILP to detect more general motifs (involving properties of amino acids AND/OR gaps) common to the incorrects. Six motifs were found that were sufficient to discriminate the incorrects from the corrects. The first one found was:
[Rule 1] [Pos cover = 9 Neg cover = 0] motif(A) :- run(very_hydrophobic,A,B), run(neutral,B,C). [pos-neg] [9]
It states that the protein should start with a contiguous sequence of very hydrophobic amino acids, immediately followed by a sequence of neutral amino acids. This is very preliminary, but looks promising to the domain expert.
As further work, the predictions by the induced tree and clustering should be verified once the classes of the unlabeled proteins become available as results of confirmatory assays. They should also be evaluated on additional proteins resulting from 2-hybrid screening data. Motifs involving secondary structures should also be investigated. Properties of pairs of residues (doublets) should also be tried out.