Chemical Data Visualization and Analysis with Incremental GTM: Big Data Challengeстатья
Статья опубликована в высокорейтинговом журнале
Информация о цитировании статьи получена из
Web of Science,
Scopus
Статья опубликована в журнале из списка Web of Science и/или Scopus
Дата последнего поиска статьи во внешних источниках: 2 апреля 2015 г.
Авторы:
Gaspar H.A.,
Baskin I.I.,
Marcou G.,
Horvath D.,
Varnek A.
Аннотация:This paper is devoted to the analysis and visualization in 2-dimensional space of large datasets of millions of compounds using the incremental version of Generative Topographic Mapping (iGTM). The iGTM algorithm implemented in the in-house ISIDA-GTM program has been applied to a database of more than 2 million compounds combining datasets of 36 chemicals suppliers and the NCI collection, encoded either by MOE descriptors or by MACCS keys. Taking advantage of the probabilistic nature of GTM, several approaches to data analysis have been proposed. Thus, to evaluate the chemical space coverage, the normalized Shannon entropy has been used. Different views of the data (property landscapes) can be obtained by mapping various physical and chemical properties (molecular weight, aqueous solubility, LogP, etc.) onto the iGTM map. The superposition of these views helps to identify the regions in the chemical space populated by compounds with desirable physico-chemical profile and to identify the suppliers providing them. Datasets similarity in the latent space has been assessed by applying different metrics (Euclidian distance, Tanimoto and Bhattacharyya coefficients) to data probability distributions based on cumulated responsibility vectors. As a complementary approach, data subsets may be compared by considering them as individual objects on a meta-GTM map built on cumulated responsibility vectors or property landscapes produced with iGTM. We believe that the iGTM methodology described in this article represents a fast and reliable way of analysis and visualization of large chemical databases.