+3 votes
in Programming Languages by (14.6k points)
I am using XGBoost on a dataset with class imbalance i.e. number of class 0 records is >100 times the number of class 1 records. Without using scale_pos_weight, I am getting poor results for true positives (TP)?

1 Answer

+3 votes
by (48.9k points)
edited by

The short answer to this question is "it depends on the data". Not one value will be suitable for all types of data.

According to XGBoost's documentation, in a binary classification problem,

scale_pos_weight = number of majority class records/number of the minority class records.

In your case, scale_pos_weight = number of class 0 records/number of class 1 records.

However, if your data is highly imbalanced, the above formula might not give you the best results. Sometimes, square_root (number of class 0 records/number of class 1 records) might provide better results. 

In my opinion, one should run GridSearch to find the optimal value of scale_pos_weight. Without scale_pos_weight, when the number of class 0 records is very high compared to the number of class 1 records, you get poor results for recall [tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives]. So, in GridSearch, use recall as a scoring parameter. Thus, the GridSearch will find the optimal value of scale_pos_weight that returns the best recall.

Here is a template for the GridSearch code:

import xgboost as xgb
from sklearn.model_selection import GridSearchCV

max_spw = count(0)/count(1)
model = xgb.XGBClassifier()
xgb_grid_params = {
    'scale_pos_weight': [i for i in range(1, max_spw, 5)]
}    
gs = GridSearchCV(model, param_grid=xgb_grid_params, scoring="recall", cv=5, verbose=7)
gs.fit(data, label)
print(gs.best_params_)


...