ИСТИНА |
Войти в систему Регистрация |
|
ИСТИНА ИНХС РАН |
||
Loss landscapes have been actively studied for parametric models such as deep neural networks, offering theoretical and practical insights. However, another popular type of machine learning algorithms, namely ensembles of decision trees (e.g., GBDT), lacks such analysis due to the complex nature of training and absence of parameters for optimization by gradient descent methods. To overcome this challenge, we consider an optimization problem that optimizes leaf weights for a given set of trees by gradient descent, which reveals several surprising phenomena about tree-based ensembles. First of all, we show that optimizing leaves of decision trees by gradient descent starting from a random point attains the same or better test performance than originally trained models while uncovering a new set of weights. Furthermore, we identify that the intrinsic dimension, i.e., the smallest number of parameters achieving solutions, is much smaller than the number of leaves and often leads to superior performance than of the trained ensemble. By comparing intrinsic dimension across ensembles, we find that models with various depths of decision trees preserve the high quality of the solution with significant speedup during training. Finally, contrary to the common belief that the first trees of gradient boosting are more powerful than the last ones, we argue that all trees are created equal for GBDT instances trained with gradient descent. It has a profound implication that different ensembles explore different families of decision trees and once the right family has been chosen by the model it is trivial to set the optimal weights for the trees. This latter result suggests new ways of designing ensembles and elucidates differences between state-of-the-art decision tree models.