Validation Data for Myristoylation Prediction |
||
|
Following is a brief synopsis of results presented in reference [1]. Please download the complete paper for details - much has been omitted from this page in the interest of brevity, and some complex issues have been oversimplified. Performance assessment methods Predictions made by our plant-specific hidden markov model (HMM) algorithm were compared with four alternative prediction methods using a pre-classified test set of plant-specific amino acid sequences. Test sequences were chosen based on a combination of biological and biochemical data from four independent sources: direct evidence for protein myristoylation in vivo, activity of peptide sequences as substrates for plant N-myristoyltransferase enzymes in vitro, subcellular membrane localization, and N-terminal sequence conservation. Score frequency distribution histograms were used to compare the ability of different algorithms to distinguish unambiguously between positive and negative test scores. The data show that our plant-specific HMM provides much clearer separation between positive and negative scores than two frequently used profile prediction algorithms based on a non-plant training set. An alternative neural net prediction method gave even wider score separation than the plant-specific HMM, but suffered from the drawback that a substantial number of sequences that should have been positive gave scores that were highly negative, indicating a problem with classification accuracy. Accuracy and coverage were calculated for each algorithm, based on detection of true positives, false positives, true negatives, and false negatives from the plant-specific test set. Performance for several of the alternative algorithms was improved by optimizing cutoff values for the set of examples being tested, but accuracy and coverage were still substantially better with our plant-specific HMM. This superior accuracy was maintained even when the test set was pruned to remove amino acid similarity among the test sequences. Receiver Operating Characteristic (ROC) analysis was used to estimate the likliehood of a negative sequence receiving a higher score than a positive one. Based on the set of test sequences used in this analysis, the probability that a negative sequence might receive a higher score than a positive one is about 3% for the plant-specific HMM, but greater than 10% for the alternative methods tested. Cross validation and Jack-knife (leave-one out) testing were performed using non-overlapping training/test sets that were progressively pruned to eliminate amino acid sequence redundancy. Results indicate 98.5% jack-knife detection of the sequences used to train the final model, and cross-validation accuracy of better than 96%, even when test and training sequences have been pruned to share no more than 40% amino acid similarity. |
||