XGBoost Cross Validation

The Python wrap around XGBoots implements a scikit-learn interface and this interface, more or less, support the scikit-learn cross validation system. More, XGBoost have is own cross validation system and the Python wrap support it. In other words, we have two cross validation systems. They are partialy supported and the functionalities supported for XGBoost are not the same for LightGBM. Currently, it's a puzzle.

The example presented covers both cases. The first, step_GradientBoostingCV, call the XGBoost cross validation. The second, step_GridSearchCV, call the scikit-learn cross validation.

The data preparation is the same as for the nbex_xgb_model.ipynb example. We take only two images to speed up the process.

The 'Tune' class manages everything. The step_GradientBoostingCV method call the XGBoost cv() function. The step_GridSearchCV method call the scikit-learn GridSearchCV() function.

Take note that this is in development and that changes can be significant.

In [4]:
%matplotlib inline

from __future__ import print_function

import os
import os.path as osp
import numpy as np

import pysptools.ml as ml
import pysptools.skl as skl

from sklearn.model_selection import train_test_split


home_path = '/mnt'
source_path = osp.join(home_path, 'dev-data/CZ_hsdb')
result_path = None


def print_step_header(step_id, title):
    print('================================================================')
    print('{}: {}'.format(step_id, title))
    print('================================================================')
    print()


# img1
img1_scaled, img1_cmap = ml.get_scaled_img_and_class_map(source_path, result_path, 'img1', 
                          [['Snow',{'rec':(41,79,49,100)}]],
                          skl.HyperGaussianNB, None,
                          display=False)
# img2
img2_scaled, img2_cmap = ml.get_scaled_img_and_class_map(source_path, result_path, 'img2', 
                          [['Snow',{'rec':(83,50,100,79)},{'rec':(107,151,111,164)}]],
                          skl.HyperLogisticRegression, {'class_weight':{0:1.0,1:5}},
                          display=False)


def step_GradientBoostingCV(tune, update, cv_params, verbose):
    print_step_header('Step', 'GradientBoosting cross validation')
    tune.print_params('input')
    tune.step_GradientBoostingCV(update, cv_params, verbose)


def step_GridSearchCV(tune, params, title, verbose):
    print_step_header('Step', 'scikit-learn cross-validation')
    tune.print_params('input')
    tune.step_GridSearchCV(params, title, verbose)
    tune.print_params('output')

X_train and y_train sets are built

The class Tune is created with the HyperXGBClassifier estimator. It's ready for cross validation, we can call Tune methods repeatedly with differents cv hypothesis.

In [5]:
verbose = False
n_shrink = 3

snow_fname = ['img1','img2']
nosnow_fname = ['imga1','imgb1','imgb6','imga7']
all_fname = snow_fname + nosnow_fname

snow_img = [img1_scaled,img2_scaled]
nosnow_img = ml.batch_load(source_path, nosnow_fname, n_shrink)

snow_cmap = [img1_cmap,img2_cmap]

M = snow_img[0]
bkg_cmap = np.zeros((M.shape[0],M.shape[1]))
    
X,y = skl.shape_to_XY(snow_img+nosnow_img, 
                      snow_cmap+[bkg_cmap,bkg_cmap,bkg_cmap,bkg_cmap])

seed = 5
train_size = 0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size,
                                                    random_state=seed)

start_param = {'max_depth':10,
               'min_child_weight':1,
               'gamma':0,
               'subsample':0.8,
               'colsample_bytree':0.5,
               'scale_pos_weight':1.5}
      
# Tune can be call with HyperXGBClassifier or HyperLGBMClassifier,
# but hyperparameters and cv parameters are differents
t = ml.Tune(ml.HyperXGBClassifier, start_param, X_train, y_train)

We set an hypothesis and call the Gradient Boosting cross validation

In [6]:
# Step 1: Fix learning rate and number of estimators for tuning tree-based parameters
step_GradientBoostingCV(t, {'learning_rate':0.2,'n_estimators':500,'silent':1},
                        {'verbose_eval':False},
                        True)
# After reading the cross validation results we manually set n_estimator
t.p_update({'n_estimators':9})
t.print_params('output')
================================================================
Step: GradientBoosting cross validation
================================================================

----------------------------------------------------------------
input parameters:

parameter           value
----------------  -------
colsample_bytree      0.5
gamma                 0
max_depth            10
min_child_weight      1
scale_pos_weight      1.5
subsample             0.8

----------------------------------------------------------------
XGBoost cross-validation tail

   test-rmse-mean  test-rmse-std  train-rmse-mean  train-rmse-std
5        0.149833       0.002526         0.138030        0.000989
6        0.128094       0.003152         0.112691        0.001023
7        0.111777       0.003751         0.092579        0.000973
8        0.100013       0.004289         0.076613        0.001092
9        0.091255       0.004772         0.063826        0.001418

----------------------------------------------------------------
output parameters:

parameter           value
----------------  -------
colsample_bytree      0.5
gamma                 0
learning_rate         0.2
max_depth            10
min_child_weight      1
n_estimators          9
scale_pos_weight      1.5
silent                1
subsample             0.8

Same but this time we call the scikit-learn cross validation

In [7]:
# Step 2: Tune max_depth and min_child_weight
step_GridSearchCV(t, {'max_depth':[24,25, 26], 'min_child_weight':[1]}, 'Step 2', True)
================================================================
Step: scikit-learn cross-validation
================================================================

----------------------------------------------------------------
input parameters:

parameter           value
----------------  -------
colsample_bytree      0.5
gamma                 0
learning_rate         0.2
max_depth            10
min_child_weight      1
n_estimators          9
scale_pos_weight      1.5
silent                1
subsample             0.8

----------------------------------------------------------------
Cross validation inputs:

n splits: 2
Shuffle: True

Parameters grid:

----------------  ------------
max_depth         [24, 25, 26]
min_child_weight  [1]
----------------  ------------

----------------------------------------------------------------
Cross validation results:

Best score: 0.992160330091

Best parameters:

----------------  --
max_depth         24
min_child_weight   1
----------------  --

All scores:

  max_depth    min_child_weight    score         std
-----------  ------------------  -------  ----------
         24                   1  0.99216  0.00011789
         25                   1  0.99216  0.00011789
         26                   1  0.99216  0.00011789

----------------------------------------------------------------
output parameters:

parameter           value
----------------  -------
colsample_bytree      0.5
gamma                 0
learning_rate         0.2
max_depth            24
min_child_weight      1
n_estimators          9
scale_pos_weight      1.5
silent                1
subsample             0.8

Finally, the result

In [8]:
print(t.get_p_current())
{'gamma': 0, 'subsample': 0.8, 'colsample_bytree': 0.5, 'scale_pos_weight': 1.5, 'learning_rate': 0.2, 'n_estimators': 9, 'silent': 1, 'max_depth': 24, 'min_child_weight': 1}