自动化特征工程#

在menu_feature_enginer.ipynb中，我们手动进行了特征工程。

这个notebook中，我们通过featuretools实现自动化这个过程。

关于featuretools说明：

woodwork 为字段赋予逻辑类型和语义标签：
- 比如int64这些，可以分为邮政编码、手机号、等等
featuretools 利用woodwork生成的table再去自动特征

文中目前并没有用到woodwork赋值语义，目前都是让他自动推断的。

导入#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import featuretools as ft
import warnings
warnings.filterwarnings('ignore')

import woodwork as ww
import gc
gc.enable()

print('ft', ft.__version__)
print('np', np.__version__)
print('pd', pd.__version__)

c:\Users\63517\miniconda3\envs\data-analysis\lib\site-packages\woodwork\__init__.py:2: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

ft 1.31.0
np 1.26.4
pd 2.3.1

application_train = pd.read_csv('data/application_train.csv')
application_test = pd.read_csv('data/application_test.csv')
bureau = pd.read_csv('data/bureau.csv')
bureau_balance = pd.read_csv('data/bureau_balance.csv')
credit_card_balance = pd.read_csv('data/credit_card_balance.csv')
installments_payments = pd.read_csv('data/installments_payments.csv')
previous_application = pd.read_csv('data/previous_application.csv')
pos_cash_balance = pd.read_csv('data/POS_CASH_balance.csv')

application_train['set'] = 'train'
application_test['set'] = 'test'
application_test['TARGET'] = np.nan
print(application_train.shape, application_test.shape)

(307511, 123) (48744, 123)

app = pd.concat([application_train, application_test], ignore_index=True)

我们让测试集和训练集一起进行特征工程，这也是符合的。

有一些字段不应该参与特征工程，我们保留下来，最后合并

app_target = app[['SK_ID_CURR', 'TARGET']]
app_set = app[['SK_ID_CURR', 'set']]
app_for_es = app.drop(columns=['set', 'TARGET']) 

app_for_es.shape

(356255, 121)

创建一个entityset#

entity表示一个表；他必须有主键唯一的列。比如sk_id_curr
entity set 表示一组表及其关系

add_dataframe 会将 df转为 woodwork的table

app_for_es['SK_ID_CURR'].unique()

array([100002, 100003, 100004, ..., 456223, 456224, 456250], dtype=int64)

es = ft.EntitySet(id='clients')

# 有主键唯一列
es = es.add_dataframe(dataframe_name="app", dataframe=app_for_es, index="SK_ID_CURR")
es = es.add_dataframe(dataframe_name="bureau", dataframe=bureau, index="SK_ID_BUREAU")
es = es.add_dataframe(dataframe_name="previous", dataframe=previous_application, index="SK_ID_PREV")

# 没有主键唯一的列，需要make_index, 创建一列主键
es = es.add_dataframe(dataframe_name="bureau_balance", dataframe=bureau_balance, make_index = True, index = 'bureaubalance_index')
es = es.add_dataframe(dataframe_name="credit", dataframe=credit_card_balance, make_index=True, index='credit_index')
es = es.add_dataframe(dataframe_name="installments", dataframe=installments_payments, make_index=True, index='installments_index')
es = es.add_dataframe(dataframe_name="cash", dataframe=pos_cash_balance,make_index=True, index="cash_index")

es

Entityset: clients
  DataFrames:
    app [Rows: 356255, Columns: 121]
    bureau [Rows: 1716428, Columns: 17]
    previous [Rows: 1670214, Columns: 37]
    bureau_balance [Rows: 27299925, Columns: 4]
    credit [Rows: 3840312, Columns: 24]
    installments [Rows: 13605401, Columns: 9]
    cash [Rows: 10001358, Columns: 9]
  Relationships:
    No relationships

为entityset添加replationship#

需要为这几个表建立关系：一对多等，类似父子

# 父亲dfname, 父亲列名； 字dfname, 子列名
es = es.add_relationship("app", "SK_ID_CURR", "bureau", "SK_ID_CURR")
es = es.add_relationship("bureau", "SK_ID_BUREAU", "bureau_balance", "SK_ID_BUREAU")

es = es.add_relationship("app", "SK_ID_CURR", "previous", "SK_ID_CURR")
es = es.add_relationship("previous", "SK_ID_PREV", "cash", "SK_ID_PREV")
es = es.add_relationship("previous", "SK_ID_PREV", "installments", "SK_ID_PREV")
es = es.add_relationship("previous", "SK_ID_PREV", "credit", "SK_ID_PREV")

es

Entityset: clients
  DataFrames:
    app [Rows: 356255, Columns: 121]
    bureau [Rows: 1716428, Columns: 17]
    previous [Rows: 1670214, Columns: 37]
    bureau_balance [Rows: 27299925, Columns: 4]
    credit [Rows: 3840312, Columns: 24]
    installments [Rows: 13605401, Columns: 9]
    cash [Rows: 10001358, Columns: 9]
  Relationships:
    bureau.SK_ID_CURR -> app.SK_ID_CURR
    bureau_balance.SK_ID_BUREAU -> bureau.SK_ID_BUREAU
    previous.SK_ID_CURR -> app.SK_ID_CURR
    cash.SK_ID_PREV -> previous.SK_ID_PREV
    installments.SK_ID_PREV -> previous.SK_ID_PREV
    credit.SK_ID_PREV -> previous.SK_ID_PREV

我们现在就准备好了entityset

feature primitive#

用于entityset的操作，代表了之前手动特征的操作。分为两类：

agg: 输出一个值
transform：输出向量

# 内置原语
ft.list_primitives().head()

	name	type	description	valid_inputs	return_type
0	max_min_delta	aggregation	Determines the difference between the max and ...	<ColumnSchema (Semantic Tags = ['numeric'])>	<ColumnSchema (Semantic Tags = ['numeric'])>
1	max_consecutive_negatives	aggregation	Determines the maximum number of consecutive n...	<ColumnSchema (Logical Type = Double)>, <Colum...	<ColumnSchema (Logical Type = Integer) (Semant...
2	any	aggregation	Determines if any value is 'True' in a list.	<ColumnSchema (Logical Type = BooleanNullable)...	<ColumnSchema (Logical Type = Boolean)>
3	count_inside_nth_std	aggregation	Determines the count of observations that lie ...	<ColumnSchema (Semantic Tags = ['numeric'])>	<ColumnSchema (Logical Type = Integer) (Semant...
4	trend	aggregation	Calculates the trend of a column over time.	<ColumnSchema (Semantic Tags = ['numeric'])>, ...	<ColumnSchema (Semantic Tags = ['numeric'])>

Warning

我们必须清晰，哪些原语作用哪些列！一些常见的。

ft.list_primitives()[ft.list_primitives()['name'] == 'mean']

	name	type	description	valid_inputs	return_type
37	mean	aggregation	Computes the average for a list of values.	<ColumnSchema (Semantic Tags = ['numeric'])>	<ColumnSchema (Semantic Tags = ['numeric'])>

ft.list_primitives()[ft.list_primitives()['name'] == 'percent_true']

	name	type	description	valid_inputs	return_type
62	percent_true	aggregation	Determines the percent of `True` values.	<ColumnSchema (Logical Type = BooleanNullable)...	<ColumnSchema (Logical Type = Double) (Semanti...

ft.list_primitives()[ft.list_primitives()['name'] == 'mode']

	name	type	description	valid_inputs	return_type
58	mode	aggregation	Determines the most commonly repeated value.	<ColumnSchema (Semantic Tags = ['category'])>	None

ft.list_primitives()[ft.list_primitives()['name'] == 'count']

	name	type	description	valid_inputs	return_type
31	count	aggregation	Determines the total number of values, excludi...	<ColumnSchema (Semantic Tags = ['index'])>	<ColumnSchema (Logical Type = IntegerNullable)...

ft.list_primitives()[ft.list_primitives()['name'] == 'month']

	name	type	description	valid_inputs	return_type
181	month	transform	Determines the month value of a datetime.	<ColumnSchema (Logical Type = Datetime)>	<ColumnSchema (Logical Type = Ordinal: [1, 2, ...

ft.list_primitives()[ft.list_primitives()['name'] == 'std']

	name	type	description	valid_inputs	return_type
2	std	aggregation	Computes the dispersion relative to the mean v...	<ColumnSchema (Semantic Tags = ['numeric'])>	<ColumnSchema (Semantic Tags = ['numeric'])>

deep feature synthesis#

就是在多层表关系中，不断agg。比如MAX(previous(MEAN(installments.payment)))) 这对人来说很难理解的很多时候。

这和menu_feature_enginer.ipynb 手动对banlance进行多层聚合到app上，是一样的。

我们不应该过度追求多的原语操作，这回造成维度灾难，

默认的，我们不需要指定分开 categorical和numeric列，ft会自动识别类型，进行聚合，转换

因此，这个过程也不涉及编码的操作。

类别列：mode 众数， num_unique 一共有多少特征

%%time
default_agg_primitives = ["count", "mean", "max", 'sum', 'std', "mode", "num_unique"]
default_trans_primitives =  ["month", "weekday"]

# 返回特征矩阵； 特征
feature_matrix, features = ft.dfs(
    entityset = es,
    target_dataframe_name = 'app', # 最后要关联到这个表，以这个为主
    agg_primitives= default_agg_primitives,
    trans_primitives=default_trans_primitives,
    max_depth=2,
)

CPU times: total: 28min 23s
Wall time: 29min 6s

这一步是最大的耗时操作，经历了30min

feature_matrix就是df，
features元素是feature对象，记录了生成过程。
- 通过 .get_name()获取特征名字

特征选择#

featuretools 也提供了一些维度进行特征选择：

移除大量空的列：ft.selection.remove_highly_null_features
移除只有单一值得特征：ft.selection.remove_single_value_features
移除高度相关得特征：ft.selection.remove_highly_correlated_features

from featuretools import selection

feature_matrix, features = selection.remove_highly_null_features(feature_matrix, features)

print(feature_matrix.shape)

(356255, 1216)

feature_matrix, features  = selection.remove_single_value_features(feature_matrix, features)

print(feature_matrix.shape)

(356255, 1169)

%%time
feature_matrix, features = selection.remove_highly_correlated_features(feature_matrix, features)

CPU times: total: 36min 12s
Wall time: 35min 41s

Warning

性能警告计算相关系数矩阵是一个极其耗时的操作。

对于相关性分析的进一步认识：

Pearson 相关系数仅在衡量连续变量之间的线性关系时有意义。对于非线性关系或离散特征，其结果可能产生误导。比如：我们不应该对onehot分类变量进行相关分析，它们之间存在严格的线性关系（例如：$type_a+ type_b = 1$），

print(feature_matrix.shape)

(356255, 589)

feature_matrix.head()

	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	NAME_TYPE_SUITE	NAME_INCOME_TYPE	...	STD(credit.previous.NFLAG_INSURED_ON_APPROVAL)	STD(credit.previous.SELLERPLACE_AREA)	SUM(credit.previous.AMT_ANNUITY)	SUM(credit.previous.AMT_APPLICATION)	SUM(credit.previous.AMT_DOWN_PAYMENT)	SUM(credit.previous.DAYS_FIRST_DRAWING)	SUM(credit.previous.DAYS_LAST_DUE)	SUM(credit.previous.DAYS_TERMINATION)	SUM(credit.previous.RATE_DOWN_PAYMENT)	SUM(credit.previous.SELLERPLACE_AREA)
SK_ID_CURR
100002	Cash loans	M	False	True	0	202500.0	406597.5	24700.5	Unaccompanied	Working	...	NaN	NaN	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
100003	Cash loans	F	False	False	0	270000.0	1293502.5	35698.5	Family	State servant	...	NaN	NaN	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
100004	Revolving loans	M	True	True	0	67500.0	135000.0	6750.0	Unaccompanied	Working	...	NaN	NaN	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
100006	Cash loans	F	False	True	0	135000.0	312682.5	29686.5	Unaccompanied	Working	...	0.0	0.0	81000.0	1620000.0	0.0	2191458.0	2191458.0	2191458.0	0.0	-6.0
100007	Cash loans	M	False	True	0	121500.0	513000.0	21865.5	Unaccompanied	Working	...	NaN	NaN	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 589 columns

feature_matrix.to_parquet("ft_feature_matrix.parquet")
ft.save_features(features, "ft_feature_definitions.json")

上面的计算太不容易了，我们将特征矩阵保存下来

feature_matrix.shape

(356255, 1233)

feature_matrix.head()

	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	REGION_POPULATION_RELATIVE	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	...	MODE(credit.previous.PRODUCT_COMBINATION)_POS mobile without interest	MODE(credit.previous.PRODUCT_COMBINATION)_POS other with interest	MODE(credit.previous.PRODUCT_COMBINATION)_POS others without interest	MODE(credit.previous.WEEKDAY_APPR_PROCESS_START)_FRIDAY	MODE(credit.previous.WEEKDAY_APPR_PROCESS_START)_MONDAY	MODE(credit.previous.WEEKDAY_APPR_PROCESS_START)_SATURDAY	MODE(credit.previous.WEEKDAY_APPR_PROCESS_START)_SUNDAY	MODE(credit.previous.WEEKDAY_APPR_PROCESS_START)_THURSDAY	MODE(credit.previous.WEEKDAY_APPR_PROCESS_START)_TUESDAY	MODE(credit.previous.WEEKDAY_APPR_PROCESS_START)_WEDNESDAY
SK_ID_CURR
100002	False	True	0	202500.0	406597.5	24700.5	0.018801	-9461	-637	-3648.0	...	False	False	False	False	False	False	False	False	False	False
100003	False	False	0	270000.0	1293502.5	35698.5	0.003541	-16765	-1188	-1186.0	...	False	False	False	False	False	False	False	False	False	False
100004	True	True	0	67500.0	135000.0	6750.0	0.010032	-19046	-225	-4260.0	...	False	False	False	False	False	False	False	False	False	False
100006	False	True	0	135000.0	312682.5	29686.5	0.008019	-19005	-3039	-9833.0	...	False	False	False	False	False	False	False	True	False	False
100007	False	True	0	121500.0	513000.0	21865.5	0.028663	-19932	-3038	-4311.0	...	False	False	False	False	False	False	False	False	False	False

5 rows × 1233 columns

Note

特征选择是否一定必要？(高缺失、高单一、高相关)

对于树模型（如 LightGBM, XGBoost,），严格的特征选择并非必要。树模型天然对共线特征、缺失值和冗余特征具有鲁棒性。
对于线性模型（如 Logistic Regression, Ridge），特征选择至关重要，尤其是共线高相关特征

model#

导入数据#

特征选择后，我们可以对train, test分开了。

feature_matrix = pd.read_parquet("ft_feature_matrix.parquet")

前面的过程，其实都不涉及编码。所以我们在模型之前进行编码。也省去了一些存储

feature_matrix = pd.get_dummies(feature_matrix)

final_fm = feature_matrix.reset_index()

final_fm = pd.merge(final_fm, app_target, on='SK_ID_CURR', how='left')
final_fm = pd.merge(final_fm, app_set, on='SK_ID_CURR', how='left')

train = final_fm[final_fm['set'] == 'train']
test = final_fm[final_fm['set'] == 'test']

train, test = train.align(test, join = 'inner', axis = 1)
train = train.drop(columns=['set'])
test = test.drop(columns = ['TARGET', 'set'])
print(train.shape, test.shape)

(307511, 1235) (48744, 1234)

train.to_feather('checkpoints/05_train_merged_ft.feather')
test.to_feather('checkpoints/05_test_merged_ft.feather')

del final_fm,feature_matrix,features,es
gc.collect()

train = pd.read_feather('checkpoints/05_train_merged_ft.feather')
test = pd.read_feather('checkpoints/05_test_merged_ft.feather')

train_labels = train['TARGET']
train_ids = train['SK_ID_CURR']
test_ids = test['SK_ID_CURR']
train_features = train.drop(columns=['TARGET', 'SK_ID_CURR'])
test_features = test.drop(columns=['SK_ID_CURR'])

print(train.shape, test.shape)

(307511, 1235) (48744, 1234)

hgbm#

%%time
from sklearn.ensemble import HistGradientBoostingClassifier

hist_gradient_boost_model= HistGradientBoostingClassifier(
    max_iter = 100, # 树个数
    learning_rate = 0.1,
    max_depth = 5,
)
hist_gradient_boost_model.fit(train_features, train_labels)

CPU times: total: 9min 6s
Wall time: 1min 10s

HistGradientBoostingClassifier(max_depth=5)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

from sklearn.metrics import  roc_curve, roc_auc_score

train_prob = hist_gradient_boost_model.predict_proba(train_features)

fpr, tpr, thresholds = roc_curve(train_labels, train_prob[:, 1])
auc = roc_auc_score(train_labels, train_prob[:, 1])

plt.figure(figsize=(3,3))
plt.plot(fpr, tpr, color='blue', lw=2)
plt.title(f'hist gb Roc curve, auc={auc:.3f}')

Text(0.5, 1.0, 'hist gb Roc curve, auc=0.812')

../../_images/59f374f15bd99e1215542aa6ef45a5c91b5809d203e467222231a9d86fc39978.png

import time
import os

def submit(ids, pred, name, feature_count=None):
    """
    ids: 测试集的 SK_ID_CURR
    pred: 模型预测概率
    name: 你的实验备注 (如 'lgb_v1', 'baseline')
    feature_count: 可选，记录模型使用了多少个特征
    """
    # 1. 创建提交 DataFrame
    submit_df = pd.DataFrame({
        'SK_ID_CURR': ids,
        'TARGET': pred
    })

    # 2. 生成时间戳 (格式: 0213_1530)
    timestamp = time.strftime("%m%d_%H%M")
    
    # 3. 构造文件名
    # 格式: 0213_1530_lgb_v1_f542.csv
    f_str = f"_f{feature_count}" if feature_count else ""
    filename = f"{timestamp}_{name}{f_str}.csv"
    
    # 4. 确保保存目录存在 (可选)
    if not os.path.exists('submissions'):
        os.makedirs('submissions')
    
    save_path = os.path.join('submissions', filename)
    
    # 5. 保存并打印提示
    submit_df.to_csv(save_path, index=False)
    
    return submit_df

hist_gradient_boost_model_pred = hist_gradient_boost_model.predict_proba(test_features)
submit_df = submit(test['SK_ID_CURR'], hist_gradient_boost_model_pred[:, 1], 
    name='hgbm_baseline',
    feature_count=train_features.shape[1]
    )
submit_df.head()

	SK_ID_CURR	TARGET
307511	100001	0.047845
307512	100005	0.114776
307513	100013	0.023606
307514	100028	0.038234
307515	100038	0.167165

得分76，和手动工程相差不多，特征数也差不多

lgbm#

import re
# 1. 定义清理函数
def clean_names(df):
    # 替换所有非字母、数字的字符为下划线
    # 这里的正则 [^A-Za-z0-9_] 会匹配空格、斜杠、括号等所有特殊字符
    df.columns = [re.sub(r'[^A-Za-z0-9_]+', '_', col) for col in df.columns]
    # 顺便处理一下可能出现的重复下划线，比如 __
    df.columns = [re.sub(r'_+', '_', col).strip('_') for col in df.columns]
    return df

train_features = clean_names(train_features)
test_features = clean_names(test_features)

from lightgbm import LGBMClassifier
lgbm_model = LGBMClassifier(
    n_estimators=100,      # 对应 max_iter，树的个数
    learning_rate=0.1,     # 学习率
    max_depth=3,           # 树的最大深度
    random_state=42,       # 保证结果可复现
    n_jobs=-1              # 使用所有 CPU 核心加速
)
lgbm_model.fit(train_features, train_labels)

[LightGBM] [Info] Number of positive: 24825, number of negative: 282686
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.679221 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 90425
[LightGBM] [Info] Number of data points in the train set: 307511, number of used features: 1088
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432486
[LightGBM] [Info] Start training from score -2.432486
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

LGBMClassifier(max_depth=3, n_jobs=-1, random_state=42)

features_importance = pd.DataFrame(
    {
        'importance': lgbm_model.feature_importances_,
        'feature': lgbm_model.feature_name_
    }
)
features_importance_plot = features_importance.sort_values(by='importance', ascending=False).head(20)

plt.figure(figsize=(8, 6), dpi=100) 
sns.barplot(data=features_importance_plot, x='importance', y='feature')

plt.yticks(fontsize=7) # 进一步微调
plt.title('Feature Importance', fontsize=14)
plt.tight_layout()

../../_images/a733ed47e9dca88db427aa90b4f8842b15e6db4e90d0e8be7140f5a8700d627f.png

可以看到，和之前手动特征工程结果几乎一样。这是符合的。

思想过程几乎一样，一个自动一个手动而已

自动化特征工程#

导入#

创建一个entityset#

为entityset添加replationship#

feature primitive#

deep feature synthesis#

特征选择#

model#

导入数据#

hgbm#

lgbm#

This Page