Wednesday, December 23, 2015

XGBoost Linear Regression output incorrect

I am a newbie to XGBoost so pardon my ignorance. Here is the python code :
import pandas as pd
import xgboost as xgb

df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30]})
X_train = df.drop('y',axis=1)
Y_train = df['y']
T_train_xgb = xgb.DMatrix(X_train, Y_train)

params = {"objective": "reg:linear"}
gbm = xgb.train(dtrain=T_train_xgb,params=params)
Y_pred = gbm.predict(xgb.DMatrix(pd.DataFrame({'x':[4,5]})))
print Y_pred
Output is :
[ 24.126194  24.126194]
As you can see the input data is simply a straight line. So the output I expect is [40,50]. What am I doing wrong here?

Answer : 

It seems that XGBoost uses regression trees as base learners by default. XGBoost (or Gradient boosting in general) work by combining multiple of these base learners. Regression trees can not extrapolate the patterns in the training data, so any input above 3 or below 1 will not be predicted correctly in your case. Your model is trained to predict outputs for inputs in the interval [1,3], an input higher than 3 will have the same output as 3, and an input less than 1 will be given the same output as 1.
Additionally, regression trees do not really see your data as a straight line as they are non-parametric models, which means they can theoretically fit any shape that is more complicated than a straight line. Roughly, a regression tree works by assigning your new input data to some of the training data points it have seen during training, and produce the output based on that.
This is in contrast to parametric regressors (like linear regression) which actually look for the best parameters of a hyperplane (straight line in your case) to fit your data. Linear regression does see your data as a straight line with a slope and and intercept.
You can change the base learner of your XGB model to a GLM (generalised linear model) by adding abooster parameter in your model params :
import pandas as pd
import xgboost as xgb

df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30]})
X_train = df.drop('y',axis=1)
Y_train = df['y']
T_train_xgb = xgb.DMatrix(X_train, Y_train)

params = {"objective": "reg:linear", "booster":"gblinear"}
gbm = xgb.train(dtrain=T_train_xgb,params=params)
Y_pred = gbm.predict(xgb.DMatrix(pd.DataFrame({'x':[4,5]})))
print Y_pred
In general, to debug why your XGBoost model is behaving in a particular way, see the model parameters :
gbm.get_dump()
If your base learner is linear model, the get_dump output is :
['bias:\n4.49469\nweight:\n7.85942\n']
In your code above, since you tree base learners, the output will be :
['0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=2.85\n\t\t4:leaf=5.85\n\t2:leaf=8.85\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=1.995\n\t\t4:leaf=4.095\n\t2:leaf=6.195\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=1.3965\n\t\t4:leaf=2.8665\n\t2:leaf=4.3365\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.97755\n\t\t4:leaf=2.00655\n\t2:leaf=3.03555\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.684285\n\t\t4:leaf=1.40458\n\t2:leaf=2.12489\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.478999\n\t\t4:leaf=0.983209\n\t2:leaf=1.48742\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.3353\n\t\t4:leaf=0.688247\n\t2:leaf=1.04119\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.23471\n\t\t4:leaf=0.481773\n\t2:leaf=0.728836\n',
 '0:[x<3] yes=1,no=2,missing=1\n\t1:[x<2] yes=3,no=4,missing=3\n\t\t3:leaf=0.164297\n\t\t4:leaf=0.337241\n\t2:leaf=0.510185\n',
 '0:[x<2] yes=1,no=2,missing=1\n\t1:leaf=0.115008\n\t2:[x<3] yes=3,no=4,missing=3\n\t\t3:leaf=0.236069\n\t\t4:leaf=0.357129\n']
Tip : I actually prefer to use xgb.XGBRegressor or xgb.XGBClassifier classes, since they follow thesci-kit learn API. And because sci-kit learn has so many machine learning algorithm implementations, using XGB as an additional library does not disturb my workflow only when I use the sci-kit interface of XGB.

No comments:

Blog Archive