Chemical Data Visualization and Analysis with Incremental GTM: Big Data Challenge

Gaspar, H.A.; Baskin, I.I.; Marcou, G.; Horvath, D.; Varnek, A.

Авторы: Gaspar H.A., Baskin I.I., Marcou G., Horvath D., Varnek A.
Журнал: Journal of Chemical Information and Modeling
Том: 55
Номер: 1
Год издания: 2015
Издательство: American Chemical Society
Местоположение издательства: United States
Первая страница: 84
Последняя страница: 94
DOI: 10.1021/ci500575y
Аннотация: This paper is devoted to the analysis and visualization in 2-dimensional space of large datasets of millions of compounds using the incremental version of Generative Topographic Mapping (iGTM). The iGTM algorithm implemented in the in-house ISIDA-GTM program has been applied to a database of more than 2 million compounds combining datasets of 36 chemicals suppliers and the NCI collection, encoded either by MOE descriptors or by MACCS keys. Taking advantage of the probabilistic nature of GTM, several approaches to data analysis have been proposed. Thus, to evaluate the chemical space coverage, the normalized Shannon entropy has been used. Different views of the data (property landscapes) can be obtained by mapping various physical and chemical properties (molecular weight, aqueous solubility, LogP, etc.) onto the iGTM map. The superposition of these views helps to identify the regions in the chemical space populated by compounds with desirable physico-chemical profile and to identify the suppliers providing them. Datasets similarity in the latent space has been assessed by applying different metrics (Euclidian distance, Tanimoto and Bhattacharyya coefficients) to data probability distributions based on cumulated responsibility vectors. As a complementary approach, data subsets may be compared by considering them as individual objects on a meta-GTM map built on cumulated responsibility vectors or property landscapes produced with iGTM. We believe that the iGTM methodology described in this article represents a fast and reliable way of analysis and visualization of large chemical databases.
Добавил в систему: Баскин Игорь Иосифович

	ИСТИНА	Войти в систему Регистрация
	ИСТИНА ИНХС РАН
	Главная Поиск Статистика О проекте Помощь

ИСТИНА

ИСТИНА ИНХС РАН

Chemical Data Visualization and Analysis with Incremental GTM: Big Data Challengeстатья