Аннотация:Five main types of histone proteins (H3, H4, H2A, H2B and H1) play key roles in genome compaction and regulation of its function. Specific subfamilies of histones—histone variants are present in many organisms and employed for specific functional purposes. The amino acid sequence between histone variants may differ considerably or just by a few amino acids, they may be also characterized by specific sequence motifs or additional domains. This variability makes it challenging to develop algorithms that may detect and classify different histone subfamilies across a wide spectrum of species. The HistoneDB 2.0 histone database was previously developed to explore the variability of histone sequences. For the update of this database we developed an improved algorithm for the automatic mining and classification of histones. Due to a large number of functionally distinct species specific histone variants that are similar at the sequence level, automatic classification of histone sequences poses certain challenges. For the first stage of the new classification algorithm we use Hidden Markov Models (HMM) trained on curated alignments of histone globular regions for each histone type. HMM models are used to extract and classify sequences from the NCBI nr protein database. Obtained sequences grouped by histone types are further subdivided into histone variant subfamilies through BLAST search with respect to curated sets of histone variants. At the next stage the classification is refined by looking at variant specific motifs and variant specific domains within the sequences. The updated histone database is expected to contain more new histone sequences and provide a more sensitive algorithm for detection of specific variant features valuable for classification.This research was supported by Russian Science Foundation grant #18-74-10006