tuning hyperparamter#
3种调参方法:
手动
grid search: 会建立一些超参数对网格,逐个尝试,效率低下
random search: 随机组合超参数网格值,
auto tuning: 贝叶斯优化等
调参往往是一个极其耗时的过程
我们会划分训练集和验证集,以评价参数
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
N_FOLDS = 5
MAX_EVALS = 5
features = pd.read_csv('data/application_train.csv')
features = features.sample(n=16000)
features = features.select_dtypes('number')
labels = np.array(features['TARGET'].astype(np.int32))
features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 6000, random_state = 50)
CV#
train_set = lgb.Dataset(data=train_features, label=train_labels)
test_set = lgb.Dataset(data=test_features, label=test_labels)
model = lgb.LGBMClassifier()
default_params = model.get_params()
default_params
{'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'importance_type': 'split',
'learning_rate': 0.1,
'max_depth': -1,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 100,
'n_jobs': None,
'num_leaves': 31,
'objective': None,
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0}
help(lgb.cv)
Help on function cv in module lightgbm.engine:
cv(params: Dict[str, Any], train_set: lightgbm.basic.Dataset, num_boost_round: int = 100, folds: Union[Iterable[Tuple[numpy.ndarray, numpy.ndarray]], sklearn.model_selection._split.BaseCrossValidator, NoneType] = None, nfold: int = 5, stratified: bool = True, shuffle: bool = True, metrics: Union[str, List[str], NoneType] = None, feval: Union[Callable[[numpy.ndarray, lightgbm.basic.Dataset], Tuple[str, float, bool]], Callable[[numpy.ndarray, lightgbm.basic.Dataset], List[Tuple[str, float, bool]]], List[Union[Callable[[numpy.ndarray, lightgbm.basic.Dataset], Tuple[str, float, bool]], Callable[[numpy.ndarray, lightgbm.basic.Dataset], List[Tuple[str, float, bool]]]]], NoneType] = None, init_model: Union[str, pathlib.Path, lightgbm.basic.Booster, NoneType] = None, fpreproc: Optional[Callable[[lightgbm.basic.Dataset, lightgbm.basic.Dataset, Dict[str, Any]], Tuple[lightgbm.basic.Dataset, lightgbm.basic.Dataset, Dict[str, Any]]]] = None, seed: int = 0, callbacks: Optional[List[Callable]] = None, eval_train_metric: bool = False, return_cvbooster: bool = False) -> Dict[str, Union[List[float], lightgbm.engine.CVBooster]]
Perform the cross-validation with given parameters.
Parameters
----------
params : dict
Parameters for training. Values passed through ``params`` take precedence over those
supplied via arguments.
train_set : Dataset
Data to be trained on.
num_boost_round : int, optional (default=100)
Number of boosting iterations.
folds : generator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)
If generator or iterator, it should yield the train and test indices for each fold.
If object, it should be one of the scikit-learn splitter classes
(https://scikit-learn.org/stable/modules/classes.html#splitter-classes)
and have ``split`` method.
This argument has highest priority over other data split arguments.
nfold : int, optional (default=5)
Number of folds in CV.
stratified : bool, optional (default=True)
Whether to perform stratified sampling.
shuffle : bool, optional (default=True)
Whether to shuffle before splitting data.
metrics : str, list of str, or None, optional (default=None)
Evaluation metrics to be monitored while CV.
If not None, the metric in ``params`` will be overridden.
feval : callable, list of callable, or None, optional (default=None)
Customized evaluation function.
Each evaluation function should accept two parameters: preds, eval_data,
and return (eval_name, eval_result, is_higher_better) or list of such tuples.
preds : numpy 1-D array or numpy 2-D array (for multi-class task)
The predicted values.
For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes].
If custom objective function is used, predicted values are returned before any transformation,
e.g. they are raw margin instead of probability of positive class for binary task in this case.
eval_data : Dataset
A ``Dataset`` to evaluate.
eval_name : str
The name of evaluation function (without whitespace).
eval_result : float
The eval result.
is_higher_better : bool
Is eval result higher better, e.g. AUC is ``is_higher_better``.
To ignore the default metric corresponding to the used objective,
set ``metrics`` to the string ``"None"``.
init_model : str, pathlib.Path, Booster or None, optional (default=None)
Filename of LightGBM model or Booster instance used for continue training.
fpreproc : callable or None, optional (default=None)
Preprocessing function that takes (dtrain, dtest, params)
and returns transformed versions of those.
seed : int, optional (default=0)
Seed used to generate the folds (passed to numpy.random.seed).
callbacks : list of callable, or None, optional (default=None)
List of callback functions that are applied at each iteration.
See Callbacks in Python API for more information.
eval_train_metric : bool, optional (default=False)
Whether to display the train metric in progress.
The score of the metric is calculated again after each training step, so there is some impact on performance.
return_cvbooster : bool, optional (default=False)
Whether to return Booster models trained on each fold through ``CVBooster``.
Note
----
A custom objective function can be provided for the ``objective`` parameter.
It should accept two parameters: preds, train_data and return (grad, hess).
preds : numpy 1-D array or numpy 2-D array (for multi-class task)
The predicted values.
Predicted values are returned before any transformation,
e.g. they are raw margin instead of probability of positive class for binary task.
train_data : Dataset
The training dataset.
grad : numpy 1-D array or numpy 2-D array (for multi-class task)
The value of the first order derivative (gradient) of the loss
with respect to the elements of preds for each sample point.
hess : numpy 1-D array or numpy 2-D array (for multi-class task)
The value of the second order derivative (Hessian) of the loss
with respect to the elements of preds for each sample point.
For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes],
and grad and hess should be returned in the same format.
Returns
-------
eval_results : dict
History of evaluation results of each metric.
The dictionary has the following format:
{'valid metric1-mean': [values], 'valid metric1-stdv': [values],
'valid metric2-mean': [values], 'valid metric2-stdv': [values],
...}.
If ``return_cvbooster=True``, also returns trained boosters wrapped in a ``CVBooster`` object via ``cvbooster`` key.
If ``eval_train_metric=True``, also returns the train metric history.
In this case, the dictionary has the following format:
{'train metric1-mean': [values], 'valid metric1-mean': [values],
'train metric2-mean': [values], 'valid metric2-mean': [values],
...}.
cv_results = lgb.cv(
default_params,
train_set,
metrics = 'auc',
num_boost_round=10000,
nfold=N_FOLDS,
callbacks=[
lgb.early_stopping(stopping_rounds=200), # 如果连续200迭代没有提升auc,就自动停止
]
)
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007028 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9966
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 93
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003572 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9966
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 93
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004317 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9966
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 93
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004565 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9966
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 93
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004165 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9966
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 93
[LightGBM] [Warning] Unknown parameter: importance_type
[LightGBM] [Info] Start training from score 0.077125
[LightGBM] [Info] Start training from score 0.077125
[LightGBM] [Info] Start training from score 0.077125
[LightGBM] [Info] Start training from score 0.077125
[LightGBM] [Info] Start training from score 0.077000
Training until validation scores don't improve for 200 rounds
Did not meet early stopping. Best iteration is:
[14] valid's auc: 0.706029 + 0.0223023
cv_results
{'valid auc-mean': [0.65323335141481,
0.6775583880504371,
0.6817425581169888,
0.6873531227080611,
0.6898062112949138,
0.6939104626960922,
0.6957445684624461,
0.6966996317241883,
0.699616249256568,
0.6992604166394905,
0.7009808396041705,
0.7013666166946605,
0.7032957304365313,
0.7060287134596595],
'valid auc-stdv': [0.024272781101888612,
0.01812150742035603,
0.024102039428268655,
0.019996301858114098,
0.021825004316099888,
0.020519177494757675,
0.025953258778170885,
0.021477480378160406,
0.02144838863473038,
0.018759257870372713,
0.01978071914621832,
0.019988042613973512,
0.0220677550289575,
0.022302328855071645]}