Hyperparameter optimization across multiple models in scikit-learn

I found myself, from time to time, always bumping into a piece of code (written by someone else) to perform grid search across different models in scikit-learn and always adapting it to suit my needs, and fixing it, since it contained some already deprecated calls. I finally decided to post it here in my blog, so I can quickly find it and also to share it with whoever needs it.

The idea is pretty simple, you pass two dictionaries to a helper class: the models and the the parameters; then you call the fit method, wait until everything runs, and after you call the summary() method to have a nice DataFrame with the report for each model instance, according to the parameters.

The credit for the code below goes to Panagiotis Katsaroumpas who initially wrote it, I just fix it, since it was breaking with newer versions of scikit-learn, and also failed in Python 3. The original version is on this blog post.

import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV

class EstimatorSelectionHelper:

    def __init__(self, models, params):
        if not set(models.keys()).issubset(set(params.keys())):
            missing_params = list(set(models.keys()) - set(params.keys()))
            raise ValueError("Some estimators are missing parameters: %s" % missing_params)
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv=3, n_jobs=3, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print("Running GridSearchCV for %s." % key)
            model = self.models[key]
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring, refit=refit,
                              return_train_score=True)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            print(k)
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns]

The code above defines the helper class, now you need to pass it a dictionary of models and a dictionary of parameters for each of the models.

from sklearn import datasets

breast_cancer = datasets.load_breast_cancer()
X_cancer = breast_cancer.data
y_cancer = breast_cancer.target

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

models1 = {
    'ExtraTreesClassifier': ExtraTreesClassifier(),
    'RandomForestClassifier': RandomForestClassifier(),
    'AdaBoostClassifier': AdaBoostClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'SVC': SVC()
}

params1 = {
    'ExtraTreesClassifier': { 'n_estimators': [16, 32] },
    'RandomForestClassifier': { 'n_estimators': [16, 32] },
    'AdaBoostClassifier':  { 'n_estimators': [16, 32] },
    'GradientBoostingClassifier': { 'n_estimators': [16, 32], 'learning_rate': [0.8, 1.0] },
    'SVC': [
        {'kernel': ['linear'], 'C': [1, 10]},
        {'kernel': ['rbf'], 'C': [1, 10], 'gamma': [0.001, 0.0001]},
    ]
}

You create a EstimatorSelectionHelper by passing the models and the parameters, and then call the fit() function, which as signature similar to the original GridSearchCV object.

helper1 = EstimatorSelectionHelper(models1, params1)
helper1.fit(X_cancer, y_cancer, scoring='f1', n_jobs=2)

Running GridSearchCV for ExtraTreesClassifier.
Fitting 3 folds for each of 2 candidates, totalling 6 fits

Running GridSearchCV for RandomForestClassifier.
Fitting 3 folds for each of 2 candidates, totalling 6 fits

Running GridSearchCV for GradientBoostingClassifier.
Fitting 3 folds for each of 4 candidates, totalling 12 fits

Running GridSearchCV for AdaBoostClassifier.
Fitting 3 folds for each of 2 candidates, totalling 6 fits

Running GridSearchCV for SVC.
Fitting 3 folds for each of 6 candidates, totalling 18 fits

After the experiments has ran, you can inspect the results of each model and each parameters by calling the score_summary method.

helper1.score_summary(sort_by='max_score')

	estimator	min_score	mean_score	max_score	std_score	C	gamma	kernel	learning_rate	n_estimators
5	AdaBoostClassifier	0.962343	0.974907	0.991667	0.0123335	NaN	NaN	NaN	NaN	32
1	ExtraTreesClassifier	0.966387	0.973627	0.987552	0.00984908	NaN	NaN	NaN	NaN	32
4	AdaBoostClassifier	0.95279	0.966463	0.983333	0.0126727	NaN	NaN	NaN	NaN	16
3	RandomForestClassifier	0.958678	0.966758	0.979253	0.00896123	NaN	NaN	NaN	NaN	32
6	GradientBoostingClassifier	0.917031	0.947595	0.979253	0.025414	NaN	NaN	NaN	0.8	16
9	GradientBoostingClassifier	0.950413	0.962373	0.979079	0.0121747	NaN	NaN	NaN	1	32
7	GradientBoostingClassifier	0.95279	0.966317	0.975207	0.00972142	NaN	NaN	NaN	0.8	32
8	GradientBoostingClassifier	0.950413	0.962548	0.975207	0.0101286	NaN	NaN	NaN	1	16
10	SVC	0.95122	0.961108	0.975207	0.0102354	1	NaN	linear	NaN	NaN
2	RandomForestClassifier	0.953191	0.960593	0.975	0.0101888	NaN	NaN	NaN	NaN	16
0	ExtraTreesClassifier	0.958678	0.96666	0.974359	0.00640498	NaN	NaN	NaN	NaN	16
11	SVC	0.961373	0.963747	0.967213	0.00250593	10	NaN	linear	NaN	NaN
15	SVC	0.935484	0.945366	0.955466	0.00815896	10	0.0001	rbf	NaN	NaN
13	SVC	0.934959	0.946564	0.954733	0.00843008	1	0.0001	rbf	NaN	NaN
12	SVC	0.926407	0.936624	0.94958	0.00965657	1	0.001	rbf	NaN	NaN
14	SVC	0.918455	0.929334	0.940678	0.00907845	10	0.001	rbf	NaN	NaN

The full code for this blog post is available in this notebook.