ИСТИНА |
Войти в систему Регистрация |
|
ИСТИНА ИНХС РАН |
||
VARIABLE SELECTION: THEORY AND APPLICATIONS Alexander Bulinski (keynote speaker) Faculty of Mathematics and Mechanics, Lomonosov Moscow State University, Moscow 119991, Russia The talk is devoted to study of models described by means of some factors collection X =(X1; : : : ;Xn) and a response variable Y . Such models are widely used, e.g., in medicine and biology where Y can characterize the health state of a patient and X comprises genetic and nongenetic factors. There are many classical models of such type considered in the well-known books on regression analysis. Recall that traditional variable selection methods include Mallow’s Cp criterion, Akaike information criterion (AIC) and Bayesian information criterion (BIC). The researcher chooses the best model applying these criteria. Nowadays one can find a number of nonclassical models introduced to discover some links between X and Y . Clearly, it is worth to take into account the goals of the models construction. For example one can try to find an optimal estimate of Y or to estimate some probabilities related to the law of Y etc. It is important to compare different models and also validate the proposed models by means of data and appropriate tools. Statistical and machine learning methods will be discussed. We are interested in models involving high-dimensional observations. The challenging problem is to identify the sub-collection of influential (in a sense) factors. Several modern variable selection procedures have been proposed in the last 20 years. Here we mention only few of them: Boolean operation-based screening and testing (BOOST), least absolute shrinkage and selection method (LASSO), adaptive and group LASSO, nonnegative (NN) Garrote, penalized regression with smoothly clipped absolute deviation (SCAD) penalty and generalized Dantzig selector (DS). In this regard we refer, e.g., to the recent books Do et al. (2013), Hastie et al. (2015), Jiang et al. (2013). Bayesian methods are in the scope of our consideration as well (see, e.g., Congdon (2014)). Note that many exhaustive, stochastic and heuristic methods to detect epistasis (in genetics) are considered in Shang et al. (2011). Special attention in the talk is paid to the multifactor dimensionality reduction (MDR) method introduced by M.Ritchie et al. and developed in a number of publications (see, e.g., Bulinski (2015), Bulinski and Rakitko (2015a) and references therein). Along with new theorems concerning the asymptotic behavior of certain statistics we tackle also the simulation problems related to variable selection (see, e.g., Bulinski and Rakitko (2015b)). The work is supported by Russian Science Foundation (RSF), project 14-21-00162. References Bulinski, A. (2015). Some statistical methods in genetics. In V. Schmidt, editor, Stochastic Geometry, Spatial Statistics and Random Fields. Models and Algorithms, volume 2120 of Lecture Notes in Mathematics, pages 293–320. Bulinski, A. and Rakitko, A. (2015a). MDR method for nonbinary response variable. J. of Multivariate Analysis, 135(4), 25–42. Bulinski, A. and Rakitko, A. (2015b). Simulation and analytical approach to the identification of significant factors. Communications in Statistics: Simulation and Computation. DOI 10.1080/03610918.2014.970700. Congdon, P. (2014). Applied Bayesian Modelling. Wiley, Chichester. Do, K.-A., Qin, Z., and Vannucci, M., editors (2013). Advances in Statistical Bioinformatics. Models and Integrative Inference for High-Throughput Data, New York. Cambridge University Press. Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity. The Lasso and Generalizations. CRC Press, Boca Raton. Jiang, R., Zhang, X., and Zhang, M., editors (2013). Basics of Bioinformatics. Lecture Notes of the Graduate Summer School on Bioinformatics of China, Berlin. Springer. Shang, J., Zhang, J., Sun, Y., Liu, D., Ye, D., and Yin, Y. (2011). Performance analysis of novel methods for detecting epistasis. BMC Bioinformatics, 12, 475.