XGBoost Explained

Looking Back to Look Forward

Neville
1 min readAug 30, 2023

XGBoost[1] and its successor LightGBM[2] are the de facto industrial standard for coping many real life machine learning problems that involve tabular data, like CTR prediction, weather prediction and fraud detection, just to name a few, even in the era of Deep Learning.

Efficient Algorithm and system design share the credits in achieving such remarkable performance of XGBoost.

  1. Algorithm Side

a. more regularizations to limit the complexity of the trees learnt

b. shrinkage: scale newly added weights by a factor for each tree boosting step

c. column sampling to further mitigate overfitting

d. approximate split point proposal for features

global proposal, local proposal

e. sparsity aware split finding that learning a default direction for missing values

2. system design

a. store data in CSC(compressed column) format

b. cache aware access by using a prefetching buffer

c. out of core computation using block compression and block sharding which makes training billions of data rows with a single machine feasible.

References

[1] Chen, Tianqi, and Carlos Guestrin. “Xgboost: A scalable tree boosting system.” Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.

[2] Ke, Guolin, et al. “Lightgbm: A highly efficient gradient boosting decision tree.” Advances in neural information processing systems 30 (2017).

--

--