# Hunt the papertiger from boosting to XGBoost, intuitively, mathematically, implementably

We will use this simple data​1​ set in all our tutorials. If we use 4th column as label, the 3rd column will be feature, vice versa.

 (1)

## Boosting

The loss function of gradient boost defined as

 (2)

Caveat To avoid heavy notation, we ignored summation symbol.

For a binary classificaton problem, we can define odds as

 (3)

and probability as

 (4)

You might wonder why we define this. In the following developments, you will find this definition will make the result be consistent with regression.

With some simple algebra,

 (5)

We can define our loss function as cross entropy, such that

 (6)

in which

 (7)

We want to find which can minimize the loss, in symbol,

 (8)

We could directly work on Equation 6 with gradient descent or closed-form solution, such that

 (9)

However, this will be quite complex.

We can use Taylor series to approximate the loss fucntion, you should convince yourself this will make things simpler, such that

 (10)

Caveat Caveat Two kinds of derivatives of appeared here, one is w.r.t. and one is w.r.t. .

With Equation 9, and set

 (11)

can be solved that

 (12)

With Equation 6, the first order derivative with respect to can be calculated as

 (13)

with some illustration

The second derivative of with respect to is

 (14)

## Reference

1. 1.
Dana D. Sleep Data Personal Sleep Data from Sleep Cycle iOS App. Kaggle. https://www.kaggle.com/danagerous/sleep-data#