Аннотация:Motivation: To facilitate accurate estimation of statistical significance of sequence similarity in profile–profile
searches, queries should ideally correspond to protein domains. For multidomain proteins, using domains as
queries depends on delineation of domain borders, which may be unknown. Thus, proteins are commonly used as
queries that complicate establishing homology for similarities close to cutoff levels of statistical significance.
Results: In this article, we describe an iterative approach, called LAMPA, LArge Multidomain Protein Annotator, that
resolves the above conundrum by gradual expansion of hit coverage of multidomain proteins through re-evaluating
statistical significance of hit similarity using ever smaller queries defined at each iteration. LAMPA employs
TMHMM and HHsearch for recognition of transmembrane regions and homology, respectively. We used Pfam database
for annotating 2985 multidomain proteins (polyproteins) composed of >1000 amino acid residues, which dominate
proteomes of RNA viruses. Under strict cutoffs, LAMPA outperformed HHsearch-mediated runs using intact
polyproteins as queries by three measures: number of and coverage by identified homologous regions, and number
of hit Pfam profiles. Compared to HHsearch, LAMPA identified 507 extra homologous regions in 14.4% of polyproteins.
This Pfam-based annotation of RNA virus polyproteins by LAMPA was also superior to RefSeq expert annotation
by two measures, region number and annotated length, for 69.3% of RNA virus polyprotein entries. We rationalized
the obtained results based on dependencies of HHsearch hit statistical significance for local alignment similarity
score from lengths and diversities of query-target pairs in computational experiments.