Introduction:

With the advancement of the human genome project, great amounts of information concerning the human DNA sequence are gathered. The value of this information is very limited if we don't know (understand) the biological meaning (significance) of the protein it expresses. The 3 dimensional structure of protein gives us a better idea (understanding) of its function. The DNA sequence itself is not enough to produce functional information because it describes only the linear sequence (arrangement) of the proteins.

Experimental discovery of the 3 dimensional structure of a protein is a very sophisticated and expensive procedure and therefore is not performed on all known proteins. Many scientists invest time (conduct researches) to find techniques for prediction of the 3D structure of a protein. Such a method will produce a prediction of a 3D structure for a protein based on a given linear sequence of amino acids (or DNA bases). Such techniques will accelerate the understanding of human genetic information and will make treatment of genetic diseases and disabilities possible.

Another important aspect of greater economical significance is the possibility of engineering medicines. When there is a need for the development of a medicine for a certain phenomena, by defining the 3D (space) structure of a desired artificial protein for the treatment of the (cause of the) problem, it will be possible to determine the sequence of amino acids needed to produce such a drug using these techniques.

Description of the problem:
In order to give quantitative evaluation for the significance (relevance) of a given prediction there is a need to establish (determine) a good meter for the similarity between the known structure of a protein and its predicted structure. This measurement is important in determining the effectiveness of a certain prediction method and to give clues about further directions for development. We wish to find maximal 3 dimensional fitting (adjustment) between segments of the predicted and the actual proteins, using rotation and translocation of the coordinates of the prediction data.

Comparison between a prediction and the actual 3D structure of a protein (also referred to as 'model') is possible only after the complicated expensive techniques for reveling the 3d structure as mentioned above are done in the labs. Before a new proteins' 3D structure is found, different methods for prediction are used. After the actual 3D structure is reveled it is possible to compare the prediction and actual structures in order to verify the validity of the method.
Till now the conventional algorithm used for comparing these structures takes all amino acids of the protein in consideration:
· A transformation (rotation and translation) is done on the predicted structure in order to get a superposition of the structures so that the RMS distance between the acids of the prediction and the model is minimal. · The RMS itself distance is used as the meter for the quality of the prediction. RMS is defined so: let M be the group (of size n) of all amino acid coordinates in the prediction, P be the correlate group of amino acid coordinates in the model, T be the placement transposition giving the minimal RMS distance, then:
RMS = SÖ(Pi - T(Mi))2 / n
The main fault of this algorithm is its inability to differentiate between high correlation segments of the prediction with the model, to low correlation ones. As an example, take a certain prediction that is almost precise to the actual protein on 50 amino acids, but is totally erroneous with the rest. The flawed segment will have such an impact on the transformation, that the superposition will so little correlation if any. Important feedback is lost having only this information.

We are in search of a different algorithm in order to find a transformation that gives better correlation between segments of the prediction and the model. We want to design an accessible, simple, and user-friendly tool that will allow research personnel the use our algorithm.

Methods:
We are looking for a maximal sized sub group of amino acids, so that the RMS distance between those of the model and those of the prediction after transformation will be within a given threshold. That is, if we add to the above definitions e to be a given distance threshold, than we are looking for a group A as follows:
e > | Pi - T(Ai) | " i Î A, A Í M, max | A |

This maximal subgroup gives a better evaluation of the resemblance (and accuracy) of between the prediction and the model because it disregards erroneous parts of the prediction.

The RMS distance of this subgroup is not useful as a meter using our algorithm because the group might be fragmented or small, making it biologically less significant. Therefore a meaningful meter must consider gaps within the selected subgroup. 'Gl-score' is a meter for evaluating the predictions' quality that incorporates both amino acid distances and number of erroneous segments, so it serves as a better indication than RMS in our case. If we add to the above definitions di to be the distance between the ith amino acid in the model and the prediction after the transformation, and G to be the number of gaps (non connected segments of the protein in the subgroup) than:
Gl-score = 1/(1+(di/e)2) - G

This problem of finding a maximal subgroup under restrictions (within limitations or margins) is a difficult problem (NP complete). We are basing on characteristics of actual proteins to implement a heurism enabling us to find such a subgroup in polynomial running time.

Our heurism is based on the assumption that similar proteins would have at least one segment of at least 5 consecutive amino acids of structural resemblance. It is good enough to start searching for such consecutive fives withstanding the threshold restriction (running in linear time). For each of these fives we perform a greedy search to expand the group over the rest of the acids (running in linear time). So we have found a solution that runs in quadratic (!) time.

Conduct of work:
As mentioned the algorithm is significant in the field of protein structure prediction. We are expecting research groups to make use of the algorithm. We have implemented a user-friendly, easy-to-use, visual interface for that matter. We have built a web site that gives services of finding a maximal similitude between a given prediction and a known protein.

The user submits a structural protein prediction file using conventional standards. The user can submit the real proteins' structural file as found in the lab, or give it's name so the site will look for the structure file on net based databases.

We perform both the conventional algorithm and our heuristic one. We present the user both meters and transformations as rotation matrices and displacement rates. In addition, we present the user with 3 dimensional representations of the superposition of the prediction on the model, using both algorithms. This last output gives subjective visual impression to achieve a more comprehensive output.

During the research we have noticed that the set threshold has immense effect on determining the subgroup and the quality of the similitude. When ? is large (about 8 angstrom), group A contained a large number of the amino acids, and our algorithm acted with little difference to the existing one, yielding high RMS. When ? is small (about 2 angstrom), the chosen subgroup is too small to have any biological significance, despite the low RMS. Inaccuracy of measurement in lab procedures for determining protein 3D structure is about 2 angstrom in itself, making such a low threshold insignificant for any practice. We were in search for such a threshold that would yield the most meaningful results, consisting of a subgroup large enough, but of reasonable RMS. We have experimentally found that ? should be set between 3.5 and 4 angstrom.