Аннотация:Motivation: Development of tools for identification of new thioredoxin-fold proteins as well as
other proteins belonging to superfamilies with low primary sequence conservation.
Results: We present several algorithms for identifying thioredoxin (Trx)-fold proteins containing
a conserved CxxC motif (two cysteines separated by two residues). The low conservation of primary
sequence in this protein superfamily makes conventional methods difficult to use. Therefore, we use
structural properties to build our classifiers. These structural properties include secondary structure
patterns as well as various properties of the residues in the protein sequences. We use this information
to model Trx-fold proteins via hidden Markov models, decision trees, and algorithms in the multipleinstance
learning model. In 9-fold and 12-fold jack-knife tests, some of our models performed quite
well, with high true positive and true negative rates. In addition, By combining a small number of our
classifiers, we can identify 100% of the Trx-fold proteins in these jack-knife tests with moderate false
positive rates. We also identified several candidate Trx-fold proteins in the C. jejuni, M. jannaschii, E.
coli and S. cerevisiae genomes. Since our techniques are very general, they should be applicable to
other superfamilies with low primary sequence conservation