tuning hyperparamter#

3种调参方法：

手动
grid search：会建立一些超参数对网格，逐个尝试，效率低下
random search: 随机组合超参数网格值，
auto tuning: 贝叶斯优化等

调参往往是一个极其耗时的过程

我们会划分训练集和验证集，以评价参数

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import  train_test_split

N_FOLDS = 5
MAX_EVALS = 5

features = pd.read_csv('data/application_train.csv')
features = features.sample(n=16000)

features = features.select_dtypes('number')

labels = np.array(features['TARGET'].astype(np.int32))
features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 6000, random_state = 50)

CV#

train_set = lgb.Dataset(data=train_features, label=train_labels)
test_set = lgb.Dataset(data=test_features, label=test_labels)

model = lgb.LGBMClassifier()

default_params = model.get_params()
default_params

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'num_leaves': 31,
 'objective': None,
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0}

help(lgb.cv)

Help on function cv in module lightgbm.engine:

cv(params: Dict[str, Any], train_set: lightgbm.basic.Dataset, num_boost_round: int = 100, folds: Union[Iterable[Tuple[numpy.ndarray, numpy.ndarray]], sklearn.model_selection._split.BaseCrossValidator, NoneType] = None, nfold: int = 5, stratified: bool = True, shuffle: bool = True, metrics: Union[str, List[str], NoneType] = None, feval: Union[Callable[[numpy.ndarray, lightgbm.basic.Dataset], Tuple[str, float, bool]], Callable[[numpy.ndarray, lightgbm.basic.Dataset], List[Tuple[str, float, bool]]], List[Union[Callable[[numpy.ndarray, lightgbm.basic.Dataset], Tuple[str, float, bool]], Callable[[numpy.ndarray, lightgbm.basic.Dataset], List[Tuple[str, float, bool]]]]], NoneType] = None, init_model: Union[str, pathlib.Path, lightgbm.basic.Booster, NoneType] = None, fpreproc: Optional[Callable[[lightgbm.basic.Dataset, lightgbm.basic.Dataset, Dict[str, Any]], Tuple[lightgbm.basic.Dataset, lightgbm.basic.Dataset, Dict[str, Any]]]] = None, seed: int = 0, callbacks: Optional[List[Callable]] = None, eval_train_metric: bool = False, return_cvbooster: bool = False) -> Dict[str, Union[List[float], lightgbm.engine.CVBooster]]
    Perform the cross-validation with given parameters.
    
    Parameters
    ----------
    params : dict
        Parameters for training. Values passed through ``params`` take precedence over those
        supplied via arguments.
    train_set : Dataset
        Data to be trained on.
    num_boost_round : int, optional (default=100)
        Number of boosting iterations.
    folds : generator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)
        If generator or iterator, it should yield the train and test indices for each fold.
        If object, it should be one of the scikit-learn splitter classes
        (https://scikit-learn.org/stable/modules/classes.html#splitter-classes)
        and have ``split`` method.
        This argument has highest priority over other data split arguments.
    nfold : int, optional (default=5)
        Number of folds in CV.
    stratified : bool, optional (default=True)
        Whether to perform stratified sampling.
    shuffle : bool, optional (default=True)
        Whether to shuffle before splitting data.
    metrics : str, list of str, or None, optional (default=None)
        Evaluation metrics to be monitored while CV.
        If not None, the metric in ``params`` will be overridden.
    feval : callable, list of callable, or None, optional (default=None)
        Customized evaluation function.
        Each evaluation function should accept two parameters: preds, eval_data,
        and return (eval_name, eval_result, is_higher_better) or list of such tuples.
    
            preds : numpy 1-D array or numpy 2-D array (for multi-class task)
                The predicted values.
                For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes].
                If custom objective function is used, predicted values are returned before any transformation,
                e.g. they are raw margin instead of probability of positive class for binary task in this case.
            eval_data : Dataset
                A ``Dataset`` to evaluate.
            eval_name : str
                The name of evaluation function (without whitespace).
            eval_result : float
                The eval result.
            is_higher_better : bool
                Is eval result higher better, e.g. AUC is ``is_higher_better``.
    
        To ignore the default metric corresponding to the used objective,
        set ``metrics`` to the string ``"None"``.
    init_model : str, pathlib.Path, Booster or None, optional (default=None)
        Filename of LightGBM model or Booster instance used for continue training.
    fpreproc : callable or None, optional (default=None)
        Preprocessing function that takes (dtrain, dtest, params)
        and returns transformed versions of those.
    seed : int, optional (default=0)
        Seed used to generate the folds (passed to numpy.random.seed).
    callbacks : list of callable, or None, optional (default=None)
        List of callback functions that are applied at each iteration.
        See Callbacks in Python API for more information.
    eval_train_metric : bool, optional (default=False)
        Whether to display the train metric in progress.
        The score of the metric is calculated again after each training step, so there is some impact on performance.
    return_cvbooster : bool, optional (default=False)
        Whether to return Booster models trained on each fold through ``CVBooster``.
    
    Note
    ----
    A custom objective function can be provided for the ``objective`` parameter.
    It should accept two parameters: preds, train_data and return (grad, hess).
    
        preds : numpy 1-D array or numpy 2-D array (for multi-class task)
            The predicted values.
            Predicted values are returned before any transformation,
            e.g. they are raw margin instead of probability of positive class for binary task.
        train_data : Dataset
            The training dataset.
        grad : numpy 1-D array or numpy 2-D array (for multi-class task)
            The value of the first order derivative (gradient) of the loss
            with respect to the elements of preds for each sample point.
        hess : numpy 1-D array or numpy 2-D array (for multi-class task)
            The value of the second order derivative (Hessian) of the loss
            with respect to the elements of preds for each sample point.
    
    For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes],
    and grad and hess should be returned in the same format.
    
    Returns
    -------
    eval_results : dict
        History of evaluation results of each metric.
        The dictionary has the following format:
        {'valid metric1-mean': [values], 'valid metric1-stdv': [values],
        'valid metric2-mean': [values], 'valid metric2-stdv': [values],
        ...}.
        If ``return_cvbooster=True``, also returns trained boosters wrapped in a ``CVBooster`` object via ``cvbooster`` key.
        If ``eval_train_metric=True``, also returns the train metric history.
        In this case, the dictionary has the following format:
        {'train metric1-mean': [values], 'valid metric1-mean': [values],
        'train metric2-mean': [values], 'valid metric2-mean': [values],
        ...}.

cv_results = lgb.cv(
    default_params,
    train_set,
    metrics = 'auc',
    num_boost_round=10000,
    nfold=N_FOLDS,
    callbacks=[
        lgb.early_stopping(stopping_rounds=200),  # 如果连续200迭代没有提升auc，就自动停止 
    ]
)

[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007028 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9966
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 93
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003572 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9966
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 93
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004317 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9966
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 93
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004565 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9966
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 93
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004165 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9966
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 93
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Start training from score 0.077125
[LightGBM] [Info] Start training from score 0.077125
[LightGBM] [Info] Start training from score 0.077125
[LightGBM] [Info] Start training from score 0.077125
[LightGBM] [Info] Start training from score 0.077000
Training until validation scores don't improve for 200 rounds
Did not meet early stopping. Best iteration is:
[14]	valid's auc: 0.706029 + 0.0223023

cv_results

{'valid auc-mean': [0.65323335141481,
6775583880504371,
6817425581169888,
6873531227080611,
6898062112949138,
6939104626960922,
6957445684624461,
6966996317241883,
699616249256568,
6992604166394905,
7009808396041705,
7013666166946605,
7032957304365313,
7060287134596595],
 'valid auc-stdv': [0.024272781101888612,
01812150742035603,
024102039428268655,
019996301858114098,
021825004316099888,
020519177494757675,
025953258778170885,
021477480378160406,
02144838863473038,
018759257870372713,
01978071914621832,
019988042613973512,
0220677550289575,
022302328855071645]}