Vanilla Gradient Descent for Oblique Decision Trees
Subrat Panda1
Blaise Genest2
Arvind Easwaran1
Ponnuthurai Suganthan3
1 NTU, Singapore
2 CNRS, IPAL, France and CNRS@CREATE, Singapore
3 Qatar University, Qatar
ECAI 2024

Abstract

Decision Trees (DTs) constitute one of the major highly non-linear AI models, valued, e.g., for their efficiency on tabular data. Learning accurate DTs is, however, complicated, especially for oblique DTs, and does take a significant training time. Further, DTs suffer from overfitting, e.g., they proverbially "do not generalize" in regression tasks. Recently, some works proposed ways to make (oblique) DTs differentiable. This enables highly efficient gradient-descent algorithms to be used to learn DTs. It also enables generalizing capabilities by learning regressors at the leaves simultaneously with the decisions in the tree. Prior approaches to making DTs differentiable rely either on probabilistic approximations at the tree's internal nodes (soft DTs) or on approximations in gradient computation at the internal node (quantized gradient descent). In this work, we propose \textit{DTSemNet}, a novel \textit{sem}antically equivalent and invertible encoding for (hard, oblique) DTs as Neural \textit{Net}works (NNs), that uses standard vanilla gradient descent. Experiments across various classification and regression benchmarks show that oblique DTs learned using \textit{DTSemNet} are more accurate than oblique DTs of similar size learned using state-of-the-art techniques. Further, DT training time is significantly reduced. We also experimentally demonstrate that \textit{DTSemNet} can learn DT policies as efficiently as NN policies in the Reinforcement Learning (RL) setup with physical inputs (dimensions $\leq32$).

Results on Classification Datasets with Small DTs




Results on Classification Datasets with Large DTs




Results on Regression Datasets




Results on Reinforcement Learning Environments




Loss Landscape


The loss landscape of DTSemNet and DGT for MNIST, when varying the parameters around the trained parameters along two random directions. The flatter the loss landscape, the better the generalization.



Paper, code and other details

Subrat Panda, Blaise Genest, Arvind Easwaran and Ponnuthurai Suganthan
Vanilla Gradient Descent for Oblique Decision Trees
European Conference on Artificial Intelligence (ECAI), 2024
[PDF] [Code] [Bibtex] [Poster]