Skip to content Skip to sidebar Skip to footer

Xgboost With Gridsearchcv, Scaling, Pca, And Early-stopping In Sklearn Pipeline

I want to combine a XGBoost model with input scaling and feature space reduction by PCA. In addition, the hyperparameters of the model as well as the number of components used in t

Solution 1:

The problem is that fit method requires an evaluation set created externally, but we cannot create one before the transformation by the pipeline.

This is a bit hacky, but the idea is to create a thin wrapper to the xgboost regressor/classifier that prepare for the evaluation set inside.

from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor, XGBClassifier

classXGBoostWithEarlyStop(BaseEstimator):
    def__init__(self, early_stopping_rounds=5, test_size=0.1, 
                 eval_metric='mae', **estimator_params):
        self.early_stopping_rounds = early_stopping_rounds
        self.test_size = test_size
        self.eval_metric=eval_metric='mae'if self.estimator isnotNone:
            self.set_params(**estimator_params)

    defset_params(self, **params):
        return self.estimator.set_params(**params)

    defget_params(self, **params):
        return self.estimator.get_params()

    deffit(self, X, y):
        x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=self.test_size)
        self.estimator.fit(x_train, y_train, 
                           early_stopping_rounds=self.early_stopping_rounds, 
                           eval_metric=self.eval_metric, eval_set=[(x_val, y_val)])
        return self

    defpredict(self, X):
        return self.estimator.predict(X)

classXGBoostRegressorWithEarlyStop(XGBoostWithEarlyStop):
    def__init__(self, *args, **kwargs):
        self.estimator = XGBRegressor()
        super(XGBoostRegressorWithEarlyStop, self).__init__(*args, **kwargs)

classXGBoostClassifierWithEarlyStop(XGBoostWithEarlyStop):
    def__init__(self, *args, **kwargs):
        self.estimator = XGBClassifier()
        super(XGBoostClassifierWithEarlyStop, self).__init__(*args, **kwargs)

Below is a test.

from sklearn.datasets import load_diabetes
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

x, y = load_diabetes(return_X_y=True)
print(x.shape, y.shape)
# (442, 10) (442,)

pipe = Pipeline([
    ('pca', PCA(5)),
    ('xgb', XGBoostRegressorWithEarlyStop())
])

param_grid = {
    'pca__n_components': [3, 5, 7],
    'xgb__n_estimators': [10, 20, 30, 50]
}

grid = GridSearchCV(pipe, param_grid, scoring='neg_mean_absolute_error')
grid.fit(x, y)
print(grid.best_params_)

If requesting feature requests to the developers, the easiest extension to make is to allow XGBRegressor to create evaluation set internally when not provided. This way, no extension to the scikit-learn is necessary (I guess).

Post a Comment for "Xgboost With Gridsearchcv, Scaling, Pca, And Early-stopping In Sklearn Pipeline"