ИСТИНА |
Войти в систему Регистрация |
|
ИСТИНА ИНХС РАН |
||
This (keynote) talk is devoted to study of stochastic models described by means of some features collection $X=(X_1,\ldots,X_n)$ and a response variable $Y.$ Such models are widely used, e.g., in medicine and biology, where $Y$ can characterize the health state of a patient and $X$ comprises genetic and nongenetic factors. We are interested in models involving high-dimensional observations. A challenging problem is to identify the sub-collection of influential (in a sense) factors. There are well-known feature selection procedures. Here we mention only few of them: Boolean operation-based screening and testing (BOOST), least absolute shrinkage and selection method (LASSO), adaptive and group LASSO,nonnegative (NN) Garrote, penalized regression with smoothly clipped absolute deviation (SCAD) penalty,least angle regression (LAR) and generalized Dantzig selector (DS). Traditionally feature selection methods are classified into filters, wrappers and embedded ones. The hybrid methods are also treated. An important subject is the development of algorithms for structured features (e.g., having group or tree, or graph structure). Usually one assumes that all features are known in advance. However, in certain models the candidate features are generated dynamically, thus the size of features is unknown. Such features are named streaming ones and a streaming feature selection arises. Feature selection methods give a possibility to improve the prediction performance, to reduce the computation time and better understand the structure of the data. This research domain is located at the border of the modern statistics and machine learning. Along with a survey we concentrate on the new results involving various concepts of information theory. Among tools we mention the transfer entropy and the Kullback - Leibler divergence. Appropriate theorems are provided to describe the asymptotic properties of statistics under consideration. Cross-validation and stability problems (including the choice of different stability measures) are in the scope as well. Special attention is paid to applications in bioinformatics, namely, in genome-wide association studies (GWAS). We tackle also the simulation problems related to feature selection.