Comparing performance of different models and choosing the best one

Hello there folks. Let’s talk about some Machine Learning today, Supervised Learning to be precise. Couple of months back, I had enrolled for the Udacity ML Advanced Nanodegree course. As part of that course, we had to choose a capstone project. I chose a problem of predicting whether a star is pulsar or not. In a nutshell, it’s a binary classification problem. The Kaggle dataset for this problem can be found here. In the labelled dataset we had about 10% pulsar stars. As the proportion of pulsars were pretty low, we wouldn’t have wanted our model to mistakenly detect actual pulsars as non-pulsars in real world. So, we set F-β score, which is our performance metric, to be more recall heavy, rather than having a bias towards precision. β=2 seems to be a good starting point in this case.

I had chosen 4 different classifiers, ranging from simple to more complex ones.

  • Logistic Regression
  • Support Vector Machine
  • AdaBoost
  • XGBoost

Here, for each classifier, we set the hyper-parameter grids. The idea is simple. We train the models by performing a grid search over the hyper-parameters, and then choose the model (with the set of hyper-parameters) that give out the best F-β score. I’ve put the relevant code snippet below. If you’re interested, you can find the complete ipython notebook in my Git Repository.

from sklearn.model_selection import GridSearchCV
from sklearn.metrics.scorer import make_scorer
from sklearn import cross_validation
import warnings

warnings.filterwarnings(action='ignore', category=DeprecationWarning)

# TODO: Initialize the classifier
clfr_A = LogisticRegression(random_state=128)
clfr_B = SVC(random_state=128)
clfr_C = AdaBoostClassifier(random_state=128)
clfr_D = XGBClassifier()

lr_param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
svc_param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma' : [0.001, 0.01, 0.1, 1]}
adb_param_grid = {'n_estimators' : [50, 100, 150, 200, 250, 500],'learning_rate' : [.5, .75, 1.0, 1.25, 1.5, 1.75, 2.0]}
xgb_param_grid = {'colsample_bylevel': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0], 'learning_rate' : [.5, .75, 1.0, 1.25, 1.5, 1.75, 2.0]}

# TODO: Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
clfrs = [clfr_A, clfr_B, clfr_C, clf_D]
names = ["LogisticRegression", "SVM", "AdaBoost", "XGBoost"]
params = [lr_param_grid, svc_param_grid, adb_param_grid, xgb_param_grid]
scorer = {'acc': 'accuracy', 'F-beta': make_scorer(fbeta_score, beta = 2), 'prec_macro': 'precision_macro','rec_micro': 'recall_macro'}

stat_list = []

for clfr, param, name in zip(clfrs, params, names):
    grid_obj = GridSearchCV(clfr, param, cv=5, scoring=scorer, refit='F-beta', n_jobs=-1)
    grid_fit = grid_obj.fit(features_minmax_transform, target_raw)
    best_clf = grid_fit.best_estimator_

    dict = {}
    dict['model'] = name
    dict['mean_acc'] = str(grid_fit.cv_results_["mean_test_acc"].mean())
    dict['mean_precision'] = str(grid_fit.cv_results_["mean_test_prec_macro"].mean())
    dict['mean_recall'] = str(grid_fit.cv_results_["mean_test_rec_micro"].mean())
    dict['mean_f-beta'] = str(grid_fit.cv_results_["mean_test_F-beta"].mean())
    stat_list.append(dict)

Leave a Reply

Your email address will not be published. Required fields are marked *