手动特征工程#

统一的特征聚合、特征选择过程

引言#

手动特征工程是繁琐的,通常依赖领域知识。 我们尽可能注入更多特征, 由模型识别。 我们会也会使用一些自动化特征工具, 会使用pca等降维

在第一个notebook中,使用了application数据集构建,为了提高分数,引入更多数据特征。

在这个notebook中,我们描述了特征聚合和特征选择的统一处理过程,这不涉及什么领域知识。

会大量使用pandas操作

  • groupby

  • agg 对分组计算

  • merge 汇总

  • rename 列

会涉及到很多特征阶段。我们必须进行一些存储。 我们通过feather命名区分: 命名规则:[阶段]_[来源表]_[处理程度]

  • 01_app_train_raw.feather : 01原始, 主表,; 仅做了onehot类型编

  • 02_prev_app_agg.feather: 02子表聚合, ; agg特征

  • 03_bureau_cleaned.feather: 03子表 : 特征选择剔除:missing、高corr

  • 04_merged_v1.feather : 04 合并子表:

  • 04_main_bureau_prev_combined.feather

此外,由于涉及大量内存复制和计算机的能力,我需要及时回收掉那些大变量

在第一部分中,通过引入bureau, bureau_balance表,我们详细阐述了统一的特征处理

  • 增加特征:聚合ID、分类、数字特征,多层聚合

  • 去掉特征:高missing、高相关的特征。

在处理完成后,我们会与application_train/test主表merge,保存特征方便后续使用

在第二部分中,引入剩余的4个表,我们为这个统一的特征处理过程设置了agg_numeric,agg_categorical,get_high_corr_columns,get_high_missing_columns,feature_select 等函数,这加快了处理速度

在完成所有表特征处理后,我们将他们合起来。进行model,看看新特征的效果。确实,很多新特征变成重要的分类特征,这也使得分数提升了

导入包#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

第一部分#

第一部分主要做两件事:

  • 引入bureau和bureau_balance特征,描述用户之前在homecredit贷款情况

  • 描述了手动特征的一般工程

    • 分类特征:

    • id特征: count计数

    • 数值统计特征:

bureau#

  • 产出 bureau_agg

train = pd.read_feather('checkpoints/01_train_app_base.feather')
test = pd.read_feather('checkpoints/01_test_app_base.feather')
print(train.shape, test.shape)
(307511, 243) (48744, 242)
bureau = pd.read_csv('data/bureau.csv')
bureau.shape
(1716428, 17)

SK_ID_BUREAU计数#

  • 产生了bureau_previous_loan_counts特征集

bureau_previous_loan_counts = bureau.groupby(by='SK_ID_CURR', as_index= False)['SK_ID_BUREAU'].count().rename(columns = {
    'SK_ID_BUREAU': 'bureau_previous_loan_counts'
})
bureau_previous_loan_counts.sort_values(by='bureau_previous_loan_counts', ascending=False)
SK_ID_CURR bureau_previous_loan_counts
17942 120860 116
59911 169704 94
187259 318065 78
130327 251643 61
279405 425396 60
... ... ...
19 100025 1
49 100061 1
39 100048 1
185447 315945 1
305809 456254 1

305811 rows × 2 columns

bureau_previous_loan_counts.shape
(305811, 2)

Assessing Usefulness of New Variable with r value

train[['SK_ID_CURR','TARGET']]
SK_ID_CURR TARGET
0 100002 1
1 100003 0
2 100004 0
3 100006 0
4 100007 0
... ... ...
307506 456251 0
307507 456252 0
307508 456253 0
307509 456254 1
307510 456255 0

307511 rows × 2 columns

bureau_previous_loan_counts_plot = pd.merge(train[['SK_ID_CURR','TARGET']], bureau_previous_loan_counts, on='SK_ID_CURR', how='left')
bureau_previous_loan_counts_plot = bureau_previous_loan_counts_plot.fillna(0)
bureau_previous_loan_counts_plot
SK_ID_CURR TARGET bureau_previous_loan_counts
0 100002 1 8.0
1 100003 0 4.0
2 100004 0 2.0
3 100006 0 0.0
4 100007 0 1.0
... ... ... ...
307506 456251 0 0.0
307507 456252 0 0.0
307508 456253 0 4.0
307509 456254 1 1.0
307510 456255 0 11.0

307511 rows × 3 columns

sns.kdeplot(
    data= bureau_previous_loan_counts_plot,
    x = 'bureau_previous_loan_counts',
    hue = 'TARGET',
    common_norm = False
)
plt.title('bureau_previous_loan_counts distribution')
Text(0.5, 1.0, 'bureau_previous_loan_counts distribution')
../../_images/adccf21ab2f2d26346070a5503d33be6bf971f8460d12cbb98e42ce15d02f475.png

几乎是同分布的,没什么区别

corr = np.corrcoef(bureau_previous_loan_counts_plot['bureau_previous_loan_counts'],
     bureau_previous_loan_counts_plot['TARGET'])[0, 1]
print(f'correlations bureau_previous_loan_counts and TARGET is {corr}')
correlations bureau_previous_loan_counts and TARGET is -0.010019715670684074
del bureau_previous_loan_counts_plot
gc.collect()
182

相关性很低,没有进入之前的top5. 我们没有得到什么信息。

Agg the numeric volumns#

  • 我们产生了bureau_numeric_agg 新的特征集

bureau.dtypes
SK_ID_CURR                  int64
SK_ID_BUREAU                int64
CREDIT_ACTIVE                 str
CREDIT_CURRENCY               str
DAYS_CREDIT                 int64
CREDIT_DAY_OVERDUE          int64
DAYS_CREDIT_ENDDATE       float64
DAYS_ENDDATE_FACT         float64
AMT_CREDIT_MAX_OVERDUE    float64
CNT_CREDIT_PROLONG          int64
AMT_CREDIT_SUM            float64
AMT_CREDIT_SUM_DEBT       float64
AMT_CREDIT_SUM_LIMIT      float64
AMT_CREDIT_SUM_OVERDUE    float64
CREDIT_TYPE                   str
DAYS_CREDIT_UPDATE          int64
AMT_ANNUITY               float64
dtype: object
bureau_numeric = bureau.select_dtypes(exclude=['str'])
bureau_numeric_agg = bureau_numeric.drop(columns=['SK_ID_BUREAU']).groupby(by='SK_ID_CURR', as_index=False).agg(
    ['min','mean',  'max', 'sum']
)
bureau_numeric_agg
SK_ID_CURR DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE ... AMT_CREDIT_SUM_OVERDUE DAYS_CREDIT_UPDATE AMT_ANNUITY
min mean max sum min mean max sum min ... max sum min mean max sum min mean max sum
0 100001 -1572 -735.000000 -49 -5145 0 0.0 0 0 -1329.0 ... 0.0 0.0 -155 -93.142857 -6 -652 0.000 3545.357143 10822.5 24817.500
1 100002 -1437 -874.000000 -103 -6992 0 0.0 0 0 -1072.0 ... 0.0 0.0 -1185 -499.875000 -7 -3999 0.000 0.000000 0.0 0.000
2 100003 -2586 -1400.750000 -606 -5603 0 0.0 0 0 -2434.0 ... 0.0 0.0 -2131 -816.000000 -43 -3264 NaN NaN NaN 0.000
3 100004 -1326 -867.000000 -408 -1734 0 0.0 0 0 -595.0 ... 0.0 0.0 -682 -532.000000 -382 -1064 NaN NaN NaN 0.000
4 100005 -373 -190.666667 -62 -572 0 0.0 0 0 -128.0 ... 0.0 0.0 -121 -54.333333 -11 -163 0.000 1420.500000 4261.5 4261.500
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
305806 456249 -2713 -1667.076923 -483 -21672 0 0.0 0 0 -2499.0 ... 0.0 0.0 -2498 -1064.538462 -12 -13839 NaN NaN NaN 0.000
305807 456250 -1002 -862.000000 -760 -2586 0 0.0 0 0 -272.0 ... 0.0 0.0 -127 -60.333333 -23 -181 27757.395 154567.965000 384147.0 463703.895
305808 456253 -919 -867.500000 -713 -3470 0 0.0 0 0 -189.0 ... 0.0 0.0 -701 -253.250000 -5 -1013 58369.500 58369.500000 58369.5 175108.500
305809 456254 -1104 -1104.000000 -1104 -1104 0 0.0 0 0 -859.0 ... 0.0 0.0 -401 -401.000000 -401 -401 0.000 0.000000 0.0 0.000
305810 456255 -2337 -1089.454545 -363 -11984 0 0.0 0 0 -1243.0 ... 0.0 0.0 -1621 -531.090909 -8 -5842 0.000 1081.500000 3244.5 9733.500

305811 rows × 49 columns

我们把多层索引平铺

bureau_numeric_agg.columns = [
    f"bureau_{col[0]}_{col[1]}" if col[1] != "" else col[0] 
    for col in bureau_numeric_agg.columns.values
]
bureau_numeric_agg
SK_ID_CURR bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_min bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_sum bureau_DAYS_CREDIT_ENDDATE_min ... bureau_AMT_CREDIT_SUM_OVERDUE_max bureau_AMT_CREDIT_SUM_OVERDUE_sum bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_sum
0 100001 -1572 -735.000000 -49 -5145 0 0.0 0 0 -1329.0 ... 0.0 0.0 -155 -93.142857 -6 -652 0.000 3545.357143 10822.5 24817.500
1 100002 -1437 -874.000000 -103 -6992 0 0.0 0 0 -1072.0 ... 0.0 0.0 -1185 -499.875000 -7 -3999 0.000 0.000000 0.0 0.000
2 100003 -2586 -1400.750000 -606 -5603 0 0.0 0 0 -2434.0 ... 0.0 0.0 -2131 -816.000000 -43 -3264 NaN NaN NaN 0.000
3 100004 -1326 -867.000000 -408 -1734 0 0.0 0 0 -595.0 ... 0.0 0.0 -682 -532.000000 -382 -1064 NaN NaN NaN 0.000
4 100005 -373 -190.666667 -62 -572 0 0.0 0 0 -128.0 ... 0.0 0.0 -121 -54.333333 -11 -163 0.000 1420.500000 4261.5 4261.500
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
305806 456249 -2713 -1667.076923 -483 -21672 0 0.0 0 0 -2499.0 ... 0.0 0.0 -2498 -1064.538462 -12 -13839 NaN NaN NaN 0.000
305807 456250 -1002 -862.000000 -760 -2586 0 0.0 0 0 -272.0 ... 0.0 0.0 -127 -60.333333 -23 -181 27757.395 154567.965000 384147.0 463703.895
305808 456253 -919 -867.500000 -713 -3470 0 0.0 0 0 -189.0 ... 0.0 0.0 -701 -253.250000 -5 -1013 58369.500 58369.500000 58369.5 175108.500
305809 456254 -1104 -1104.000000 -1104 -1104 0 0.0 0 0 -859.0 ... 0.0 0.0 -401 -401.000000 -401 -401 0.000 0.000000 0.0 0.000
305810 456255 -2337 -1089.454545 -363 -11984 0 0.0 0 0 -1243.0 ... 0.0 0.0 -1621 -531.090909 -8 -5842 0.000 1081.500000 3244.5 9733.500

305811 rows × 49 columns

bureau_numeric_agg.shape
(305811, 49)

计算一下相关性把

new_columns = list(bureau_numeric_agg.columns)
new_columns.remove('SK_ID_CURR')
new_columns
['bureau_DAYS_CREDIT_min',
 'bureau_DAYS_CREDIT_mean',
 'bureau_DAYS_CREDIT_max',
 'bureau_DAYS_CREDIT_sum',
 'bureau_CREDIT_DAY_OVERDUE_min',
 'bureau_CREDIT_DAY_OVERDUE_mean',
 'bureau_CREDIT_DAY_OVERDUE_max',
 'bureau_CREDIT_DAY_OVERDUE_sum',
 'bureau_DAYS_CREDIT_ENDDATE_min',
 'bureau_DAYS_CREDIT_ENDDATE_mean',
 'bureau_DAYS_CREDIT_ENDDATE_max',
 'bureau_DAYS_CREDIT_ENDDATE_sum',
 'bureau_DAYS_ENDDATE_FACT_min',
 'bureau_DAYS_ENDDATE_FACT_mean',
 'bureau_DAYS_ENDDATE_FACT_max',
 'bureau_DAYS_ENDDATE_FACT_sum',
 'bureau_AMT_CREDIT_MAX_OVERDUE_min',
 'bureau_AMT_CREDIT_MAX_OVERDUE_mean',
 'bureau_AMT_CREDIT_MAX_OVERDUE_max',
 'bureau_AMT_CREDIT_MAX_OVERDUE_sum',
 'bureau_CNT_CREDIT_PROLONG_min',
 'bureau_CNT_CREDIT_PROLONG_mean',
 'bureau_CNT_CREDIT_PROLONG_max',
 'bureau_CNT_CREDIT_PROLONG_sum',
 'bureau_AMT_CREDIT_SUM_min',
 'bureau_AMT_CREDIT_SUM_mean',
 'bureau_AMT_CREDIT_SUM_max',
 'bureau_AMT_CREDIT_SUM_sum',
 'bureau_AMT_CREDIT_SUM_DEBT_min',
 'bureau_AMT_CREDIT_SUM_DEBT_mean',
 'bureau_AMT_CREDIT_SUM_DEBT_max',
 'bureau_AMT_CREDIT_SUM_DEBT_sum',
 'bureau_AMT_CREDIT_SUM_LIMIT_min',
 'bureau_AMT_CREDIT_SUM_LIMIT_mean',
 'bureau_AMT_CREDIT_SUM_LIMIT_max',
 'bureau_AMT_CREDIT_SUM_LIMIT_sum',
 'bureau_AMT_CREDIT_SUM_OVERDUE_min',
 'bureau_AMT_CREDIT_SUM_OVERDUE_mean',
 'bureau_AMT_CREDIT_SUM_OVERDUE_max',
 'bureau_AMT_CREDIT_SUM_OVERDUE_sum',
 'bureau_DAYS_CREDIT_UPDATE_min',
 'bureau_DAYS_CREDIT_UPDATE_mean',
 'bureau_DAYS_CREDIT_UPDATE_max',
 'bureau_DAYS_CREDIT_UPDATE_sum',
 'bureau_AMT_ANNUITY_min',
 'bureau_AMT_ANNUITY_mean',
 'bureau_AMT_ANNUITY_max',
 'bureau_AMT_ANNUITY_sum']
corrs = bureau_numeric_agg[new_columns].corrwith(train['TARGET']).sort_values(ascending=False)
corrs.head()
bureau_AMT_CREDIT_MAX_OVERDUE_min     0.007180
bureau_AMT_CREDIT_MAX_OVERDUE_mean    0.006485
bureau_AMT_CREDIT_SUM_OVERDUE_max     0.005342
bureau_AMT_CREDIT_MAX_OVERDUE_max     0.005038
bureau_AMT_CREDIT_SUM_OVERDUE_sum     0.004478
dtype: float64
corrs.tail()
bureau_AMT_CREDIT_SUM_DEBT_sum   -0.001602
bureau_AMT_ANNUITY_mean          -0.001616
bureau_AMT_CREDIT_SUM_DEBT_max   -0.002066
bureau_DAYS_CREDIT_max           -0.002341
bureau_DAYS_ENDDATE_FACT_max     -0.002621
dtype: float64

可以看到,我们构造出几个正相关 还不错的特征!!!😍

继续画图看看

print(train.shape, bureau_numeric_agg.shape)
(307511, 243) (305811, 49)
bureau_numeric_agg_plot = pd.merge(train[['SK_ID_CURR', 'TARGET']], 
    bureau_numeric_agg[['SK_ID_CURR', 'bureau_DAYS_CREDIT_mean']],
    on = 'SK_ID_CURR',
    how = 'left'
    )
plt.figure(figsize=(5,3))
sns.kdeplot(
    data = bureau_numeric_agg_plot,
    x = 'bureau_DAYS_CREDIT_mean',
    hue = 'TARGET',
    common_norm=False
)
<Axes: xlabel='bureau_DAYS_CREDIT_mean', ylabel='Density'>
../../_images/328d9f56951a31fca5e57f66d7c241c628e87d93263d34c46c5b65aa5dc4fbd6.png

可以看到,分布还是有点区别,那些DAYS_CREDIT_mean 平均贷款天数更多的越容易违约

categorical columns#

bureau_categorical_agg

对于分类变量, 我们可以统计次数和平均次数

bureau_categorical = pd.get_dummies(bureau.select_dtypes(include='str'))
bureau_categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']
bureau_categorical
CREDIT_ACTIVE_Active CREDIT_ACTIVE_Bad debt CREDIT_ACTIVE_Closed CREDIT_ACTIVE_Sold CREDIT_CURRENCY_currency 1 CREDIT_CURRENCY_currency 2 CREDIT_CURRENCY_currency 3 CREDIT_CURRENCY_currency 4 CREDIT_TYPE_Another type of loan CREDIT_TYPE_Car loan ... CREDIT_TYPE_Loan for business development CREDIT_TYPE_Loan for purchase of shares (margin lending) CREDIT_TYPE_Loan for the purchase of equipment CREDIT_TYPE_Loan for working capital replenishment CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan SK_ID_CURR
0 False False True False True False False False False False ... False False False False False False False False False 215354
1 True False False False True False False False False False ... False False False False False False False False False 215354
2 True False False False True False False False False False ... False False False False False False False False False 215354
3 True False False False True False False False False False ... False False False False False False False False False 215354
4 True False False False True False False False False False ... False False False False False False False False False 215354
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1716423 True False False False True False False False False False ... False False False False True False False False False 259355
1716424 False False True False True False False False False False ... False False False False False False False False False 100044
1716425 False False True False True False False False False False ... False False False False False False False False False 100044
1716426 False False True False True False False False False False ... False False False False False False False False False 246829
1716427 False False True False True False False False False False ... False False False False True False False False False 246829

1716428 rows × 24 columns

bureau_categorical_agg = bureau_categorical.groupby(by='SK_ID_CURR').agg(
    ['sum', 'mean']
)
bureau_categorical_agg.head()
CREDIT_ACTIVE_Active CREDIT_ACTIVE_Bad debt CREDIT_ACTIVE_Closed CREDIT_ACTIVE_Sold CREDIT_CURRENCY_currency 1 ... CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan
sum mean sum mean sum mean sum mean sum mean ... sum mean sum mean sum mean sum mean sum mean
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

5 rows × 46 columns

bureau_categorical_agg.columns[0]
('CREDIT_ACTIVE_Active', 'sum')
bureau_categorical_agg.columns = [ f'bureau_{col[0]}_{col[1]}' for col in bureau_categorical_agg.columns]
bureau_categorical_agg.head()
bureau_CREDIT_ACTIVE_Active_sum bureau_CREDIT_ACTIVE_Active_mean bureau_CREDIT_ACTIVE_Bad debt_sum bureau_CREDIT_ACTIVE_Bad debt_mean bureau_CREDIT_ACTIVE_Closed_sum bureau_CREDIT_ACTIVE_Closed_mean bureau_CREDIT_ACTIVE_Sold_sum bureau_CREDIT_ACTIVE_Sold_mean bureau_CREDIT_CURRENCY_currency 1_sum bureau_CREDIT_CURRENCY_currency 1_mean ... bureau_CREDIT_TYPE_Microloan_sum bureau_CREDIT_TYPE_Microloan_mean bureau_CREDIT_TYPE_Mobile operator loan_sum bureau_CREDIT_TYPE_Mobile operator loan_mean bureau_CREDIT_TYPE_Mortgage_sum bureau_CREDIT_TYPE_Mortgage_mean bureau_CREDIT_TYPE_Real estate loan_sum bureau_CREDIT_TYPE_Real estate loan_mean bureau_CREDIT_TYPE_Unknown type of loan_sum bureau_CREDIT_TYPE_Unknown type of loan_mean
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

5 rows × 46 columns

bureau_categorical_agg = bureau_categorical_agg.reset_index()
bureau_categorical_agg.head()
SK_ID_CURR bureau_CREDIT_ACTIVE_Active_sum bureau_CREDIT_ACTIVE_Active_mean bureau_CREDIT_ACTIVE_Bad debt_sum bureau_CREDIT_ACTIVE_Bad debt_mean bureau_CREDIT_ACTIVE_Closed_sum bureau_CREDIT_ACTIVE_Closed_mean bureau_CREDIT_ACTIVE_Sold_sum bureau_CREDIT_ACTIVE_Sold_mean bureau_CREDIT_CURRENCY_currency 1_sum ... bureau_CREDIT_TYPE_Microloan_sum bureau_CREDIT_TYPE_Microloan_mean bureau_CREDIT_TYPE_Mobile operator loan_sum bureau_CREDIT_TYPE_Mobile operator loan_mean bureau_CREDIT_TYPE_Mortgage_sum bureau_CREDIT_TYPE_Mortgage_mean bureau_CREDIT_TYPE_Real estate loan_sum bureau_CREDIT_TYPE_Real estate loan_mean bureau_CREDIT_TYPE_Unknown type of loan_sum bureau_CREDIT_TYPE_Unknown type of loan_mean
0 100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
1 100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
2 100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
3 100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
4 100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

5 rows × 47 columns

把上述特征合并起来

print(bureau_categorical_agg.shape, bureau_previous_loan_counts.shape, bureau_numeric_agg.shape)
(305811, 47) (305811, 2) (305811, 49)
dfs = [df.set_index('SK_ID_CURR') for df in [bureau_categorical_agg, bureau_previous_loan_counts, bureau_numeric_agg]]
bureau_agg = pd.concat(dfs, axis=1)
bureau_agg = bureau_agg.reset_index()
bureau_agg.head()
SK_ID_CURR bureau_CREDIT_ACTIVE_Active_sum bureau_CREDIT_ACTIVE_Active_mean bureau_CREDIT_ACTIVE_Bad debt_sum bureau_CREDIT_ACTIVE_Bad debt_mean bureau_CREDIT_ACTIVE_Closed_sum bureau_CREDIT_ACTIVE_Closed_mean bureau_CREDIT_ACTIVE_Sold_sum bureau_CREDIT_ACTIVE_Sold_mean bureau_CREDIT_CURRENCY_currency 1_sum ... bureau_AMT_CREDIT_SUM_OVERDUE_max bureau_AMT_CREDIT_SUM_OVERDUE_sum bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_sum
0 100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 ... 0.0 0.0 -155 -93.142857 -6 -652 0.0 3545.357143 10822.5 24817.5
1 100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 ... 0.0 0.0 -1185 -499.875000 -7 -3999 0.0 0.000000 0.0 0.0
2 100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 ... 0.0 0.0 -2131 -816.000000 -43 -3264 NaN NaN NaN 0.0
3 100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 ... 0.0 0.0 -682 -532.000000 -382 -1064 NaN NaN NaN 0.0
4 100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 ... 0.0 0.0 -121 -54.333333 -11 -163 0.0 1420.500000 4261.5 4261.5

5 rows × 96 columns

bureau_agg.shape
(305811, 96)
bureau_agg.to_feather('checkpoints/02_bureau_agg.feather')
del bureau_categorical_agg, bureau_previous_loan_counts, bureau_numeric_agg,bureau_agg
gc.collect()
132

bureau_balance#

  • 产生 bureau_balance_by_client_agg

train = pd.read_feather('checkpoints/01_train_app_base.feather')
test = pd.read_feather('checkpoints/01_test_app_base.feather')
bureau_balance = pd.read_csv('data/bureau_balance.csv')
bureau_balance.shape
(27299925, 3)
bureau_balance.head()
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C
bureau_balance.dtypes
SK_ID_BUREAU      int64
MONTHS_BALANCE    int64
STATUS              str
dtype: object
bureau_balance.head()
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C

因此对于MONTHS_BALANCE 可以聚合数字特征。STATUS 为分类统计。

这是很多相同的。我们可以为此写两个函数agg_numeric agg_categorical

def agg_numeric(df : pd.DataFrame, group_column, df_name = '', exclude_columns = []):
    """ 聚合数值特征: ['min', 'max', 'mean', 'sum'] 这是一般共有的
    group_column:
    df_name:
    exclude_columns: 需要排除一些id列. 一般不需要。
    """
    numeric_df = df.select_dtypes('number')
    numeric_df = numeric_df.drop(columns = exclude_columns)

    numeric_df[group_column] = df[group_column]
    numeric_df_agg = numeric_df.groupby(by = group_column).agg(
        ['min', 'max', 'mean', 'sum']
    )
    prefix = f'{df_name}_' if df_name != '' else ''
    numeric_df_agg.columns = [
        f"{prefix}{col[0]}_{col[1]}".upper()
        for col in numeric_df_agg.columns.values
    ]
    numeric_df_agg = numeric_df_agg.reset_index()
    return numeric_df_agg
def agg_categorical(df : pd.DataFrame, group_column, df_name=''):
    """ 聚合数值特征: ['mean', 'sum'] 这是一般共有的
    group_column:
    df_name:
    """
    categorical_df = pd.get_dummies(df.select_dtypes(include = ['str', 'object', 'category']))
    categorical_df[group_column] = df[group_column]
    categorical_df_agg = categorical_df.groupby(by = group_column).agg(
        ['mean', 'sum']
    )
    categorical_df_agg.columns =  [
        f"{df_name}_{col[0]}_{col[1]}" if col[1] != "" else col[0] 
        for col in categorical_df_agg.columns.values
    ]
    categorical_df_agg = categorical_df_agg.reset_index()
    return categorical_df_agg
bureau_balance_numeric_agg = agg_numeric(bureau_balance, 'SK_ID_BUREAU', 'bureau', exclude_columns=['SK_ID_BUREAU'])
bureau_balance_numeric_agg.head()
SK_ID_BUREAU BUREAU_MONTHS_BALANCE_MIN BUREAU_MONTHS_BALANCE_MAX BUREAU_MONTHS_BALANCE_MEAN BUREAU_MONTHS_BALANCE_SUM
0 5001709 -96 0 -48.0 -4656
1 5001710 -82 0 -41.0 -3403
2 5001711 -3 0 -1.5 -6
3 5001712 -18 0 -9.0 -171
4 5001713 -21 0 -10.5 -231
bureau_balance_numeric_agg.shape
(817395, 5)
bureau_balance_categorical_agg = agg_categorical(bureau_balance, 'SK_ID_BUREAU', 'bureau')
bureau_balance_categorical_agg.head()
SK_ID_BUREAU bureau_STATUS_0_mean bureau_STATUS_0_sum bureau_STATUS_1_mean bureau_STATUS_1_sum bureau_STATUS_2_mean bureau_STATUS_2_sum bureau_STATUS_3_mean bureau_STATUS_3_sum bureau_STATUS_4_mean bureau_STATUS_4_sum bureau_STATUS_5_mean bureau_STATUS_5_sum bureau_STATUS_C_mean bureau_STATUS_C_sum bureau_STATUS_X_mean bureau_STATUS_X_sum
0 5001709 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.886598 86 0.113402 11
1 5001710 0.060241 5 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.578313 48 0.361446 30
2 5001711 0.750000 3 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.250000 1
3 5001712 0.526316 10 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.473684 9 0.000000 0
4 5001713 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 1.000000 22
bureau_balance_categorical_agg.shape
(817395, 17)
bureau_balance_agg = pd.merge(bureau_balance_numeric_agg, bureau_balance_categorical_agg, on='SK_ID_BUREAU', how='left')
bureau_balance_agg.head()
SK_ID_BUREAU BUREAU_MONTHS_BALANCE_MIN BUREAU_MONTHS_BALANCE_MAX BUREAU_MONTHS_BALANCE_MEAN BUREAU_MONTHS_BALANCE_SUM bureau_STATUS_0_mean bureau_STATUS_0_sum bureau_STATUS_1_mean bureau_STATUS_1_sum bureau_STATUS_2_mean ... bureau_STATUS_3_mean bureau_STATUS_3_sum bureau_STATUS_4_mean bureau_STATUS_4_sum bureau_STATUS_5_mean bureau_STATUS_5_sum bureau_STATUS_C_mean bureau_STATUS_C_sum bureau_STATUS_X_mean bureau_STATUS_X_sum
0 5001709 -96 0 -48.0 -4656 0.000000 0 0.0 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.886598 86 0.113402 11
1 5001710 -82 0 -41.0 -3403 0.060241 5 0.0 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.578313 48 0.361446 30
2 5001711 -3 0 -1.5 -6 0.750000 3 0.0 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.000000 0 0.250000 1
3 5001712 -18 0 -9.0 -171 0.526316 10 0.0 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.473684 9 0.000000 0
4 5001713 -21 0 -10.5 -231 0.000000 0 0.0 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.000000 0 1.000000 22

5 rows × 21 columns

进一步聚合到SK_ID_CURR

bureau[['SK_ID_CURR', 'SK_ID_BUREAU']]
SK_ID_CURR SK_ID_BUREAU
0 215354 5714462
1 215354 5714463
2 215354 5714464
3 215354 5714465
4 215354 5714466
... ... ...
1716423 259355 5057750
1716424 100044 5057754
1716425 100044 5057762
1716426 246829 5057770
1716427 246829 5057778

1716428 rows × 2 columns

bureau_balance_by_client = pd.merge(
    bureau_balance_agg,
    bureau[['SK_ID_CURR', 'SK_ID_BUREAU']], on='SK_ID_BUREAU', how='right'
    )
bureau_balance_by_client.head()
SK_ID_BUREAU BUREAU_MONTHS_BALANCE_MIN BUREAU_MONTHS_BALANCE_MAX BUREAU_MONTHS_BALANCE_MEAN BUREAU_MONTHS_BALANCE_SUM bureau_STATUS_0_mean bureau_STATUS_0_sum bureau_STATUS_1_mean bureau_STATUS_1_sum bureau_STATUS_2_mean ... bureau_STATUS_3_sum bureau_STATUS_4_mean bureau_STATUS_4_sum bureau_STATUS_5_mean bureau_STATUS_5_sum bureau_STATUS_C_mean bureau_STATUS_C_sum bureau_STATUS_X_mean bureau_STATUS_X_sum SK_ID_CURR
0 5714462 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 215354
1 5714463 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 215354
2 5714464 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 215354
3 5714465 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 215354
4 5714466 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 215354

5 rows × 22 columns

这里id-bureau 会有多个,因为每个用户之前有多个申请记录, 我们聚合一下

bureau_balance_by_client_agg = agg_numeric(bureau_balance_by_client, 'SK_ID_CURR', '', exclude_columns=['SK_ID_CURR', 'SK_ID_BUREAU'])
bureau_balance_by_client_agg.head()
SK_ID_CURR BUREAU_MONTHS_BALANCE_MIN_MIN BUREAU_MONTHS_BALANCE_MIN_MAX BUREAU_MONTHS_BALANCE_MIN_MEAN BUREAU_MONTHS_BALANCE_MIN_SUM BUREAU_MONTHS_BALANCE_MAX_MIN BUREAU_MONTHS_BALANCE_MAX_MAX BUREAU_MONTHS_BALANCE_MAX_MEAN BUREAU_MONTHS_BALANCE_MAX_SUM BUREAU_MONTHS_BALANCE_MEAN_MIN ... BUREAU_STATUS_C_SUM_MEAN BUREAU_STATUS_C_SUM_SUM BUREAU_STATUS_X_MEAN_MIN BUREAU_STATUS_X_MEAN_MAX BUREAU_STATUS_X_MEAN_MEAN BUREAU_STATUS_X_MEAN_SUM BUREAU_STATUS_X_SUM_MIN BUREAU_STATUS_X_SUM_MAX BUREAU_STATUS_X_SUM_MEAN BUREAU_STATUS_X_SUM_SUM
0 100001 -51.0 -1.0 -23.571429 -165.0 0.0 0.0 0.0 0.0 -25.5 ... 15.714286 110.0 0.0 0.500000 0.214590 1.502129 0.0 9.0 4.285714 30.0
1 100002 -47.0 -3.0 -28.250000 -226.0 -32.0 0.0 -15.5 -124.0 -39.5 ... 2.875000 23.0 0.0 0.500000 0.161932 1.295455 0.0 3.0 1.875000 15.0
2 100003 NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN ... NaN 0.0 NaN NaN NaN 0.000000 NaN NaN NaN 0.0
3 100004 NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN ... NaN 0.0 NaN NaN NaN 0.000000 NaN NaN NaN 0.0
4 100005 -12.0 -2.0 -6.000000 -18.0 0.0 0.0 0.0 0.0 -6.0 ... 1.666667 5.0 0.0 0.333333 0.136752 0.410256 0.0 1.0 0.666667 2.0

5 rows × 81 columns

由于一对多,所以行数增加的

bureau_balance_by_client_agg.to_feather('checkpoints/02_bureau_balance_agg.feather')

特征选择#

bureau_balance_agg = pd.read_feather('checkpoints/02_bureau_balance_agg.feather')
bureau_agg = pd.read_feather('checkpoints/02_bureau_agg.feather')

train = pd.read_feather('checkpoints/01_train_app_base.feather')
test = pd.read_feather('checkpoints/01_test_app_base.feather')
print(train.shape, test.shape)

train = pd.merge(train, bureau_agg, on='SK_ID_CURR', how='left')
train = pd.merge(train, bureau_balance_agg, on='SK_ID_CURR', how='left')
test = pd.merge(test, bureau_agg, on='SK_ID_CURR', how='left')
test = pd.merge(test, bureau_balance_agg, on='SK_ID_CURR', how='left')

print(train.shape, test.shape)
(307511, 243) (48744, 242)
(307511, 418) (48744, 417)

缺失值#

def missing_values_table(df):
    """ 统计缺失值
    """
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(columns= {0:'Missing Values', 1:'% of total values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns.sort_values(by='% of total values', ascending = False)
    mis_val_table_ren_columns = mis_val_table_ren_columns.loc[mis_val_table_ren_columns['% of total values'] != 0, :]

    return mis_val_table_ren_columns
missing_train = missing_values_table(train)
missing_train.head(10)
Missing Values % of total values
bureau_AMT_ANNUITY_max 227502 73.981744
bureau_AMT_ANNUITY_mean 227502 73.981744
bureau_AMT_ANNUITY_min 227502 73.981744
BUREAU_STATUS_3_SUM_MIN 215280 70.007252
BUREAU_STATUS_3_MEAN_MAX 215280 70.007252
BUREAU_STATUS_2_SUM_MIN 215280 70.007252
BUREAU_STATUS_2_MEAN_MIN 215280 70.007252
BUREAU_STATUS_2_MEAN_MEAN 215280 70.007252
BUREAU_STATUS_2_MEAN_MAX 215280 70.007252
BUREAU_STATUS_2_SUM_MEAN 215280 70.007252

对于一些缺失过多的字段可以考虑删除掉。比如去掉 超过90%的列

missing_90_columns = missing_train.index[missing_train['% of total values'] > 90]
missing_90_columns
Index([], dtype='str')
missing_values_table(test)
Missing Values % of total values
COMMONAREA_AVG 33495 68.716150
COMMONAREA_MODE 33495 68.716150
COMMONAREA_MEDI 33495 68.716150
NONLIVINGAPARTMENTS_MODE 33347 68.412523
NONLIVINGAPARTMENTS_MEDI 33347 68.412523
... ... ...
OBS_60_CNT_SOCIAL_CIRCLE 29 0.059495
DEF_60_CNT_SOCIAL_CIRCLE 29 0.059495
DEF_30_CNT_SOCIAL_CIRCLE 29 0.059495
AMT_ANNUITY 24 0.049237
EXT_SOURCE_2 8 0.016412

233 rows × 2 columns

对齐train和test列#

print(train.shape, test.shape)
(307511, 418) (48744, 417)
train_labels = train['TARGET']
train, test = train.align(test, join='inner', axis=1)
train['TARGET'] = train_labels
print(train.shape, test.shape)
(307511, 418) (48744, 417)

相关矩阵#

%%time
corrs = train.corr()
CPU times: total: 2min
Wall time: 2min 1s
corrs_sorted = corrs.sort_values(by='TARGET', ascending=False)
corrs_sorted['TARGET'].head(10)
TARGET                              1.000000
bureau_DAYS_CREDIT_mean             0.089729
BUREAU_MONTHS_BALANCE_MIN_MEAN      0.089038
DAYS_BIRTH                          0.078239
bureau_CREDIT_ACTIVE_Active_mean    0.077356
BUREAU_MONTHS_BALANCE_MEAN_MEAN     0.076424
bureau_DAYS_CREDIT_min              0.075248
BUREAU_MONTHS_BALANCE_MIN_MIN       0.073225
BUREAU_MONTHS_BALANCE_SUM_MEAN      0.072606
bureau_DAYS_CREDIT_UPDATE_mean      0.068927
Name: TARGET, dtype: float64
corrs_sorted['TARGET'].tail(10)
NAME_INCOME_TYPE_Pensioner             -0.046209
CODE_GENDER_F                          -0.054704
BUREAU_STATUS_C_MEAN_MEAN              -0.055936
NAME_EDUCATION_TYPE_Higher education   -0.056593
BUREAU_STATUS_C_SUM_MAX                -0.061083
BUREAU_STATUS_C_SUM_MEAN               -0.062954
bureau_CREDIT_ACTIVE_Closed_mean       -0.079369
EXT_SOURCE_1                           -0.155317
EXT_SOURCE_2                           -0.160472
EXT_SOURCE_3                           -0.178919
Name: TARGET, dtype: float64

可以看到,我们的一些新特征确实有更好的相关性

sns.kdeplot(
    data = train,
    x = 'bureau_CREDIT_ACTIVE_Active_mean',
    hue = 'TARGET',
    common_norm = False
)
<Axes: xlabel='bureau_CREDIT_ACTIVE_Active_mean', ylabel='Density'>
../../_images/d4d4b77316d0ae9a70ceb99dc1b223cd2b1d7eeb9c0a4adbfb9cdf26c779528a.png

好吧,目前看来没啥用

此外我们可以剔除一些高度相关变量

corrs.head()
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION ... BUREAU_STATUS_C_SUM_SUM BUREAU_STATUS_X_MEAN_MIN BUREAU_STATUS_X_MEAN_MAX BUREAU_STATUS_X_MEAN_MEAN BUREAU_STATUS_X_MEAN_SUM BUREAU_STATUS_X_SUM_MIN BUREAU_STATUS_X_SUM_MAX BUREAU_STATUS_X_SUM_MEAN BUREAU_STATUS_X_SUM_SUM TARGET
SK_ID_CURR 1.000000 -0.001129 -0.001820 -0.000343 -0.000433 -0.000232 0.000849 -0.001500 0.001366 -0.000973 ... 0.000252 -0.003102 0.003164 0.000689 0.001937 -0.003786 -0.002939 -0.003459 -0.000683 -0.002108
CNT_CHILDREN -0.001129 1.000000 0.012882 0.002145 0.021374 -0.001827 -0.025573 0.330938 -0.239818 0.183395 ... -0.005527 -0.001161 0.005020 0.002323 -0.001398 -0.003052 -0.003205 -0.004988 -0.003957 0.019187
AMT_INCOME_TOTAL -0.001820 0.012882 1.000000 0.156870 0.191657 0.159610 0.074796 0.027261 -0.064223 0.027805 ... 0.024610 -0.031615 0.074149 0.020127 0.026904 -0.021876 0.060925 0.022379 0.024900 -0.003982
AMT_CREDIT -0.000343 0.002145 0.156870 1.000000 0.770138 0.986968 0.099738 -0.055436 -0.066838 0.009621 ... 0.023609 0.003360 0.023895 0.019604 0.015181 0.014507 0.038554 0.036641 0.024139 -0.030369
AMT_ANNUITY -0.000433 0.021374 0.191657 0.770138 1.000000 0.775109 0.118429 0.009445 -0.104332 0.038514 ... 0.101982 -0.006178 0.019453 0.007627 0.076965 0.009223 0.036626 0.030085 0.077655 -0.012817

5 rows × 418 columns

corr_abs = corrs.abs()
threshold = 0.8
high_corr = {}
for col in corr_abs:
    high_corr[col] = list(corr_abs.index[(corr_abs[col] > threshold) & (corr_abs[col] != 1)])
high_corr
{'SK_ID_CURR': [],
 'CNT_CHILDREN': ['CNT_FAM_MEMBERS'],
 'AMT_INCOME_TOTAL': [],
 'AMT_CREDIT': ['AMT_GOODS_PRICE'],
 'AMT_ANNUITY': [],
 'AMT_GOODS_PRICE': ['AMT_CREDIT'],
 'REGION_POPULATION_RELATIVE': [],
 'DAYS_BIRTH': [],
 'DAYS_EMPLOYED': ['FLAG_EMP_PHONE',
  'NAME_INCOME_TYPE_Pensioner',
  'ORGANIZATION_TYPE_XNA'],
 'DAYS_REGISTRATION': [],
 'DAYS_ID_PUBLISH': [],
 'OWN_CAR_AGE': [],
 'FLAG_MOBIL': [],
 'FLAG_EMP_PHONE': ['DAYS_EMPLOYED',
  'NAME_INCOME_TYPE_Pensioner',
  'ORGANIZATION_TYPE_XNA'],
 'FLAG_WORK_PHONE': [],
 'FLAG_CONT_MOBILE': [],
 'FLAG_PHONE': [],
 'FLAG_EMAIL': [],
 'CNT_FAM_MEMBERS': ['CNT_CHILDREN'],
 'REGION_RATING_CLIENT': ['REGION_RATING_CLIENT_W_CITY'],
 'REGION_RATING_CLIENT_W_CITY': ['REGION_RATING_CLIENT'],
 'HOUR_APPR_PROCESS_START': [],
 'REG_REGION_NOT_LIVE_REGION': [],
 'REG_REGION_NOT_WORK_REGION': ['LIVE_REGION_NOT_WORK_REGION'],
 'LIVE_REGION_NOT_WORK_REGION': ['REG_REGION_NOT_WORK_REGION'],
 'REG_CITY_NOT_LIVE_CITY': [],
 'REG_CITY_NOT_WORK_CITY': ['LIVE_CITY_NOT_WORK_CITY'],
 'LIVE_CITY_NOT_WORK_CITY': ['REG_CITY_NOT_WORK_CITY'],
 'EXT_SOURCE_1': [],
 'EXT_SOURCE_2': [],
 'EXT_SOURCE_3': [],
 'APARTMENTS_AVG': ['ELEVATORS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'ELEVATORS_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'ELEVATORS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'BASEMENTAREA_AVG': ['BASEMENTAREA_MODE', 'BASEMENTAREA_MEDI'],
 'YEARS_BEGINEXPLUATATION_AVG': ['YEARS_BEGINEXPLUATATION_MODE',
  'YEARS_BEGINEXPLUATATION_MEDI'],
 'YEARS_BUILD_AVG': ['YEARS_BUILD_MODE', 'YEARS_BUILD_MEDI'],
 'COMMONAREA_AVG': ['COMMONAREA_MODE', 'COMMONAREA_MEDI'],
 'ELEVATORS_AVG': ['APARTMENTS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'ELEVATORS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'ELEVATORS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'ENTRANCES_AVG': ['ENTRANCES_MODE', 'ENTRANCES_MEDI'],
 'FLOORSMAX_AVG': ['FLOORSMAX_MODE', 'FLOORSMAX_MEDI'],
 'FLOORSMIN_AVG': ['FLOORSMIN_MODE', 'FLOORSMIN_MEDI'],
 'LANDAREA_AVG': ['LANDAREA_MODE', 'LANDAREA_MEDI'],
 'LIVINGAPARTMENTS_AVG': ['APARTMENTS_AVG',
  'ELEVATORS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'ELEVATORS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'LIVINGAREA_AVG': ['APARTMENTS_AVG',
  'ELEVATORS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'APARTMENTS_MODE',
  'ELEVATORS_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'ELEVATORS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'NONLIVINGAPARTMENTS_AVG': ['NONLIVINGAPARTMENTS_MODE',
  'NONLIVINGAPARTMENTS_MEDI'],
 'NONLIVINGAREA_AVG': ['NONLIVINGAREA_MODE', 'NONLIVINGAREA_MEDI'],
 'APARTMENTS_MODE': ['APARTMENTS_AVG',
  'ELEVATORS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'ELEVATORS_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'ELEVATORS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'BASEMENTAREA_MODE': ['BASEMENTAREA_AVG', 'BASEMENTAREA_MEDI'],
 'YEARS_BEGINEXPLUATATION_MODE': ['YEARS_BEGINEXPLUATATION_AVG',
  'YEARS_BEGINEXPLUATATION_MEDI'],
 'YEARS_BUILD_MODE': ['YEARS_BUILD_AVG', 'YEARS_BUILD_MEDI'],
 'COMMONAREA_MODE': ['COMMONAREA_AVG', 'COMMONAREA_MEDI'],
 'ELEVATORS_MODE': ['APARTMENTS_AVG',
  'ELEVATORS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'ELEVATORS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'ENTRANCES_MODE': ['ENTRANCES_AVG', 'ENTRANCES_MEDI'],
 'FLOORSMAX_MODE': ['FLOORSMAX_AVG', 'FLOORSMAX_MEDI'],
 'FLOORSMIN_MODE': ['FLOORSMIN_AVG', 'FLOORSMIN_MEDI'],
 'LANDAREA_MODE': ['LANDAREA_AVG', 'LANDAREA_MEDI'],
 'LIVINGAPARTMENTS_MODE': ['APARTMENTS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'ELEVATORS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'LIVINGAREA_MODE': ['APARTMENTS_AVG',
  'ELEVATORS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'ELEVATORS_MODE',
  'LIVINGAPARTMENTS_MODE',
  'APARTMENTS_MEDI',
  'ELEVATORS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'NONLIVINGAPARTMENTS_MODE': ['NONLIVINGAPARTMENTS_AVG',
  'NONLIVINGAPARTMENTS_MEDI'],
 'NONLIVINGAREA_MODE': ['NONLIVINGAREA_AVG', 'NONLIVINGAREA_MEDI'],
 'APARTMENTS_MEDI': ['APARTMENTS_AVG',
  'ELEVATORS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'ELEVATORS_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'ELEVATORS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'BASEMENTAREA_MEDI': ['BASEMENTAREA_AVG', 'BASEMENTAREA_MODE'],
 'YEARS_BEGINEXPLUATATION_MEDI': ['YEARS_BEGINEXPLUATATION_AVG',
  'YEARS_BEGINEXPLUATATION_MODE'],
 'YEARS_BUILD_MEDI': ['YEARS_BUILD_AVG', 'YEARS_BUILD_MODE'],
 'COMMONAREA_MEDI': ['COMMONAREA_AVG', 'COMMONAREA_MODE'],
 'ELEVATORS_MEDI': ['APARTMENTS_AVG',
  'ELEVATORS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'ELEVATORS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'ENTRANCES_MEDI': ['ENTRANCES_AVG', 'ENTRANCES_MODE'],
 'FLOORSMAX_MEDI': ['FLOORSMAX_AVG', 'FLOORSMAX_MODE'],
 'FLOORSMIN_MEDI': ['FLOORSMIN_AVG', 'FLOORSMIN_MODE'],
 'LANDAREA_MEDI': ['LANDAREA_AVG', 'LANDAREA_MODE'],
 'LIVINGAPARTMENTS_MEDI': ['APARTMENTS_AVG',
  'ELEVATORS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'ELEVATORS_MEDI',
  'LIVINGAREA_MEDI',
  'TOTALAREA_MODE'],
 'LIVINGAREA_MEDI': ['APARTMENTS_AVG',
  'ELEVATORS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'ELEVATORS_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'ELEVATORS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'TOTALAREA_MODE'],
 'NONLIVINGAPARTMENTS_MEDI': ['NONLIVINGAPARTMENTS_AVG',
  'NONLIVINGAPARTMENTS_MODE'],
 'NONLIVINGAREA_MEDI': ['NONLIVINGAREA_AVG', 'NONLIVINGAREA_MODE'],
 'TOTALAREA_MODE': ['APARTMENTS_AVG',
  'ELEVATORS_AVG',
  'LIVINGAPARTMENTS_AVG',
  'LIVINGAREA_AVG',
  'APARTMENTS_MODE',
  'ELEVATORS_MODE',
  'LIVINGAPARTMENTS_MODE',
  'LIVINGAREA_MODE',
  'APARTMENTS_MEDI',
  'ELEVATORS_MEDI',
  'LIVINGAPARTMENTS_MEDI',
  'LIVINGAREA_MEDI'],
 'OBS_30_CNT_SOCIAL_CIRCLE': ['OBS_60_CNT_SOCIAL_CIRCLE'],
 'DEF_30_CNT_SOCIAL_CIRCLE': ['DEF_60_CNT_SOCIAL_CIRCLE'],
 'OBS_60_CNT_SOCIAL_CIRCLE': ['OBS_30_CNT_SOCIAL_CIRCLE'],
 'DEF_60_CNT_SOCIAL_CIRCLE': ['DEF_30_CNT_SOCIAL_CIRCLE'],
 'DAYS_LAST_PHONE_CHANGE': [],
 'FLAG_DOCUMENT_2': [],
 'FLAG_DOCUMENT_3': [],
 'FLAG_DOCUMENT_4': [],
 'FLAG_DOCUMENT_5': [],
 'FLAG_DOCUMENT_6': [],
 'FLAG_DOCUMENT_7': [],
 'FLAG_DOCUMENT_8': [],
 'FLAG_DOCUMENT_9': [],
 'FLAG_DOCUMENT_10': [],
 'FLAG_DOCUMENT_11': [],
 'FLAG_DOCUMENT_12': [],
 'FLAG_DOCUMENT_13': [],
 'FLAG_DOCUMENT_14': [],
 'FLAG_DOCUMENT_15': [],
 'FLAG_DOCUMENT_16': [],
 'FLAG_DOCUMENT_17': [],
 'FLAG_DOCUMENT_18': [],
 'FLAG_DOCUMENT_19': [],
 'FLAG_DOCUMENT_20': [],
 'FLAG_DOCUMENT_21': [],
 'AMT_REQ_CREDIT_BUREAU_HOUR': [],
 'AMT_REQ_CREDIT_BUREAU_DAY': [],
 'AMT_REQ_CREDIT_BUREAU_WEEK': [],
 'AMT_REQ_CREDIT_BUREAU_MON': [],
 'AMT_REQ_CREDIT_BUREAU_QRT': [],
 'AMT_REQ_CREDIT_BUREAU_YEAR': [],
 'NAME_CONTRACT_TYPE_Cash loans': [],
 'NAME_CONTRACT_TYPE_Revolving loans': [],
 'CODE_GENDER_F': ['CODE_GENDER_M'],
 'CODE_GENDER_M': ['CODE_GENDER_F'],
 'FLAG_OWN_CAR_N': [],
 'FLAG_OWN_CAR_Y': [],
 'FLAG_OWN_REALTY_N': ['FLAG_OWN_REALTY_Y'],
 'FLAG_OWN_REALTY_Y': ['FLAG_OWN_REALTY_N'],
 'NAME_TYPE_SUITE_Children': [],
 'NAME_TYPE_SUITE_Family': [],
 'NAME_TYPE_SUITE_Group of people': [],
 'NAME_TYPE_SUITE_Other_A': [],
 'NAME_TYPE_SUITE_Other_B': [],
 'NAME_TYPE_SUITE_Spouse, partner': [],
 'NAME_TYPE_SUITE_Unaccompanied': [],
 'NAME_INCOME_TYPE_Businessman': [],
 'NAME_INCOME_TYPE_Commercial associate': [],
 'NAME_INCOME_TYPE_Pensioner': ['DAYS_EMPLOYED',
  'FLAG_EMP_PHONE',
  'ORGANIZATION_TYPE_XNA'],
 'NAME_INCOME_TYPE_State servant': [],
 'NAME_INCOME_TYPE_Student': [],
 'NAME_INCOME_TYPE_Unemployed': [],
 'NAME_INCOME_TYPE_Working': [],
 'NAME_EDUCATION_TYPE_Academic degree': [],
 'NAME_EDUCATION_TYPE_Higher education': ['NAME_EDUCATION_TYPE_Secondary / secondary special'],
 'NAME_EDUCATION_TYPE_Incomplete higher': [],
 'NAME_EDUCATION_TYPE_Lower secondary': [],
 'NAME_EDUCATION_TYPE_Secondary / secondary special': ['NAME_EDUCATION_TYPE_Higher education'],
 'NAME_FAMILY_STATUS_Civil marriage': [],
 'NAME_FAMILY_STATUS_Married': [],
 'NAME_FAMILY_STATUS_Separated': [],
 'NAME_FAMILY_STATUS_Single / not married': [],
 'NAME_FAMILY_STATUS_Widow': [],
 'NAME_HOUSING_TYPE_Co-op apartment': [],
 'NAME_HOUSING_TYPE_House / apartment': [],
 'NAME_HOUSING_TYPE_Municipal apartment': [],
 'NAME_HOUSING_TYPE_Office apartment': [],
 'NAME_HOUSING_TYPE_Rented apartment': [],
 'NAME_HOUSING_TYPE_With parents': [],
 'OCCUPATION_TYPE_Accountants': [],
 'OCCUPATION_TYPE_Cleaning staff': [],
 'OCCUPATION_TYPE_Cooking staff': [],
 'OCCUPATION_TYPE_Core staff': [],
 'OCCUPATION_TYPE_Drivers': [],
 'OCCUPATION_TYPE_HR staff': [],
 'OCCUPATION_TYPE_High skill tech staff': [],
 'OCCUPATION_TYPE_IT staff': [],
 'OCCUPATION_TYPE_Laborers': [],
 'OCCUPATION_TYPE_Low-skill Laborers': [],
 'OCCUPATION_TYPE_Managers': [],
 'OCCUPATION_TYPE_Medicine staff': [],
 'OCCUPATION_TYPE_Private service staff': [],
 'OCCUPATION_TYPE_Realty agents': [],
 'OCCUPATION_TYPE_Sales staff': [],
 'OCCUPATION_TYPE_Secretaries': [],
 'OCCUPATION_TYPE_Security staff': [],
 'OCCUPATION_TYPE_Waiters/barmen staff': [],
 'WEEKDAY_APPR_PROCESS_START_FRIDAY': [],
 'WEEKDAY_APPR_PROCESS_START_MONDAY': [],
 'WEEKDAY_APPR_PROCESS_START_SATURDAY': [],
 'WEEKDAY_APPR_PROCESS_START_SUNDAY': [],
 'WEEKDAY_APPR_PROCESS_START_THURSDAY': [],
 'WEEKDAY_APPR_PROCESS_START_TUESDAY': [],
 'WEEKDAY_APPR_PROCESS_START_WEDNESDAY': [],
 'ORGANIZATION_TYPE_Advertising': [],
 'ORGANIZATION_TYPE_Agriculture': [],
 'ORGANIZATION_TYPE_Bank': [],
 'ORGANIZATION_TYPE_Business Entity Type 1': [],
 'ORGANIZATION_TYPE_Business Entity Type 2': [],
 'ORGANIZATION_TYPE_Business Entity Type 3': [],
 'ORGANIZATION_TYPE_Cleaning': [],
 'ORGANIZATION_TYPE_Construction': [],
 'ORGANIZATION_TYPE_Culture': [],
 'ORGANIZATION_TYPE_Electricity': [],
 'ORGANIZATION_TYPE_Emergency': [],
 'ORGANIZATION_TYPE_Government': [],
 'ORGANIZATION_TYPE_Hotel': [],
 'ORGANIZATION_TYPE_Housing': [],
 'ORGANIZATION_TYPE_Industry: type 1': [],
 'ORGANIZATION_TYPE_Industry: type 10': [],
 'ORGANIZATION_TYPE_Industry: type 11': [],
 'ORGANIZATION_TYPE_Industry: type 12': [],
 'ORGANIZATION_TYPE_Industry: type 13': [],
 'ORGANIZATION_TYPE_Industry: type 2': [],
 'ORGANIZATION_TYPE_Industry: type 3': [],
 'ORGANIZATION_TYPE_Industry: type 4': [],
 'ORGANIZATION_TYPE_Industry: type 5': [],
 'ORGANIZATION_TYPE_Industry: type 6': [],
 'ORGANIZATION_TYPE_Industry: type 7': [],
 'ORGANIZATION_TYPE_Industry: type 8': [],
 'ORGANIZATION_TYPE_Industry: type 9': [],
 'ORGANIZATION_TYPE_Insurance': [],
 'ORGANIZATION_TYPE_Kindergarten': [],
 'ORGANIZATION_TYPE_Legal Services': [],
 'ORGANIZATION_TYPE_Medicine': [],
 'ORGANIZATION_TYPE_Military': [],
 'ORGANIZATION_TYPE_Mobile': [],
 'ORGANIZATION_TYPE_Other': [],
 'ORGANIZATION_TYPE_Police': [],
 'ORGANIZATION_TYPE_Postal': [],
 'ORGANIZATION_TYPE_Realtor': [],
 'ORGANIZATION_TYPE_Religion': [],
 'ORGANIZATION_TYPE_Restaurant': [],
 'ORGANIZATION_TYPE_School': [],
 'ORGANIZATION_TYPE_Security': [],
 'ORGANIZATION_TYPE_Security Ministries': [],
 'ORGANIZATION_TYPE_Self-employed': [],
 'ORGANIZATION_TYPE_Services': [],
 'ORGANIZATION_TYPE_Telecom': [],
 'ORGANIZATION_TYPE_Trade: type 1': [],
 'ORGANIZATION_TYPE_Trade: type 2': [],
 'ORGANIZATION_TYPE_Trade: type 3': [],
 'ORGANIZATION_TYPE_Trade: type 4': [],
 'ORGANIZATION_TYPE_Trade: type 5': [],
 'ORGANIZATION_TYPE_Trade: type 6': [],
 'ORGANIZATION_TYPE_Trade: type 7': [],
 'ORGANIZATION_TYPE_Transport: type 1': [],
 'ORGANIZATION_TYPE_Transport: type 2': [],
 'ORGANIZATION_TYPE_Transport: type 3': [],
 'ORGANIZATION_TYPE_Transport: type 4': [],
 'ORGANIZATION_TYPE_University': [],
 'ORGANIZATION_TYPE_XNA': ['DAYS_EMPLOYED',
  'FLAG_EMP_PHONE',
  'NAME_INCOME_TYPE_Pensioner'],
 'FONDKAPREMONT_MODE_not specified': [],
 'FONDKAPREMONT_MODE_org spec account': [],
 'FONDKAPREMONT_MODE_reg oper account': [],
 'FONDKAPREMONT_MODE_reg oper spec account': [],
 'HOUSETYPE_MODE_block of flats': ['EMERGENCYSTATE_MODE_No'],
 'HOUSETYPE_MODE_specific housing': [],
 'HOUSETYPE_MODE_terraced house': [],
 'WALLSMATERIAL_MODE_Block': [],
 'WALLSMATERIAL_MODE_Mixed': [],
 'WALLSMATERIAL_MODE_Monolithic': [],
 'WALLSMATERIAL_MODE_Others': [],
 'WALLSMATERIAL_MODE_Panel': [],
 'WALLSMATERIAL_MODE_Stone, brick': [],
 'WALLSMATERIAL_MODE_Wooden': [],
 'EMERGENCYSTATE_MODE_No': ['HOUSETYPE_MODE_block of flats'],
 'EMERGENCYSTATE_MODE_Yes': [],
 'bureau_CREDIT_ACTIVE_Active_sum': [],
 'bureau_CREDIT_ACTIVE_Active_mean': ['bureau_CREDIT_ACTIVE_Closed_mean'],
 'bureau_CREDIT_ACTIVE_Bad debt_sum': ['bureau_CREDIT_ACTIVE_Bad debt_mean'],
 'bureau_CREDIT_ACTIVE_Bad debt_mean': ['bureau_CREDIT_ACTIVE_Bad debt_sum'],
 'bureau_CREDIT_ACTIVE_Closed_sum': ['bureau_CREDIT_CURRENCY_currency 1_sum',
  'bureau_CREDIT_TYPE_Consumer credit_sum',
  'bureau_previous_loan_counts',
  'bureau_DAYS_CREDIT_sum',
  'bureau_DAYS_ENDDATE_FACT_sum',
  'bureau_DAYS_CREDIT_UPDATE_sum'],
 'bureau_CREDIT_ACTIVE_Closed_mean': ['bureau_CREDIT_ACTIVE_Active_mean'],
 'bureau_CREDIT_ACTIVE_Sold_sum': [],
 'bureau_CREDIT_ACTIVE_Sold_mean': [],
 'bureau_CREDIT_CURRENCY_currency 1_sum': ['bureau_CREDIT_ACTIVE_Closed_sum',
  'bureau_CREDIT_TYPE_Consumer credit_sum',
  'bureau_previous_loan_counts',
  'bureau_DAYS_CREDIT_sum',
  'bureau_DAYS_ENDDATE_FACT_sum'],
 'bureau_CREDIT_CURRENCY_currency 1_mean': ['bureau_CREDIT_CURRENCY_currency 2_mean'],
 'bureau_CREDIT_CURRENCY_currency 2_sum': [],
 'bureau_CREDIT_CURRENCY_currency 2_mean': ['bureau_CREDIT_CURRENCY_currency 1_mean'],
 'bureau_CREDIT_CURRENCY_currency 3_sum': [],
 'bureau_CREDIT_CURRENCY_currency 3_mean': [],
 'bureau_CREDIT_CURRENCY_currency 4_sum': ['bureau_CREDIT_CURRENCY_currency 4_mean'],
 'bureau_CREDIT_CURRENCY_currency 4_mean': ['bureau_CREDIT_CURRENCY_currency 4_sum'],
 'bureau_CREDIT_TYPE_Another type of loan_sum': [],
 'bureau_CREDIT_TYPE_Another type of loan_mean': [],
 'bureau_CREDIT_TYPE_Car loan_sum': [],
 'bureau_CREDIT_TYPE_Car loan_mean': [],
 'bureau_CREDIT_TYPE_Cash loan (non-earmarked)_sum': ['bureau_CREDIT_TYPE_Cash loan (non-earmarked)_mean'],
 'bureau_CREDIT_TYPE_Cash loan (non-earmarked)_mean': ['bureau_CREDIT_TYPE_Cash loan (non-earmarked)_sum'],
 'bureau_CREDIT_TYPE_Consumer credit_sum': ['bureau_CREDIT_ACTIVE_Closed_sum',
  'bureau_CREDIT_CURRENCY_currency 1_sum',
  'bureau_previous_loan_counts',
  'bureau_DAYS_CREDIT_sum',
  'bureau_DAYS_ENDDATE_FACT_sum'],
 'bureau_CREDIT_TYPE_Consumer credit_mean': ['bureau_CREDIT_TYPE_Credit card_mean'],
 'bureau_CREDIT_TYPE_Credit card_sum': [],
 'bureau_CREDIT_TYPE_Credit card_mean': ['bureau_CREDIT_TYPE_Consumer credit_mean'],
 'bureau_CREDIT_TYPE_Interbank credit_sum': ['bureau_CREDIT_TYPE_Interbank credit_mean'],
 'bureau_CREDIT_TYPE_Interbank credit_mean': ['bureau_CREDIT_TYPE_Interbank credit_sum'],
 'bureau_CREDIT_TYPE_Loan for business development_sum': [],
 'bureau_CREDIT_TYPE_Loan for business development_mean': [],
 'bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_sum': ['bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_mean'],
 'bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_mean': ['bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_sum'],
 'bureau_CREDIT_TYPE_Loan for the purchase of equipment_sum': [],
 'bureau_CREDIT_TYPE_Loan for the purchase of equipment_mean': [],
 'bureau_CREDIT_TYPE_Loan for working capital replenishment_sum': [],
 'bureau_CREDIT_TYPE_Loan for working capital replenishment_mean': [],
 'bureau_CREDIT_TYPE_Microloan_sum': [],
 'bureau_CREDIT_TYPE_Microloan_mean': [],
 'bureau_CREDIT_TYPE_Mobile operator loan_sum': [],
 'bureau_CREDIT_TYPE_Mobile operator loan_mean': [],
 'bureau_CREDIT_TYPE_Mortgage_sum': [],
 'bureau_CREDIT_TYPE_Mortgage_mean': [],
 'bureau_CREDIT_TYPE_Real estate loan_sum': [],
 'bureau_CREDIT_TYPE_Real estate loan_mean': [],
 'bureau_CREDIT_TYPE_Unknown type of loan_sum': [],
 'bureau_CREDIT_TYPE_Unknown type of loan_mean': [],
 'bureau_previous_loan_counts': ['bureau_CREDIT_ACTIVE_Closed_sum',
  'bureau_CREDIT_CURRENCY_currency 1_sum',
  'bureau_CREDIT_TYPE_Consumer credit_sum',
  'bureau_DAYS_CREDIT_sum',
  'bureau_DAYS_ENDDATE_FACT_sum'],
 'bureau_DAYS_CREDIT_min': ['bureau_DAYS_CREDIT_mean',
  'bureau_DAYS_ENDDATE_FACT_min',
  'BUREAU_MONTHS_BALANCE_MIN_MIN',
  'BUREAU_MONTHS_BALANCE_MEAN_MIN',
  'BUREAU_MONTHS_BALANCE_SUM_MIN'],
 'bureau_DAYS_CREDIT_mean': ['bureau_DAYS_CREDIT_min',
  'BUREAU_MONTHS_BALANCE_MIN_MEAN',
  'BUREAU_MONTHS_BALANCE_MEAN_MEAN',
  'BUREAU_MONTHS_BALANCE_SUM_MEAN'],
 'bureau_DAYS_CREDIT_max': ['BUREAU_MONTHS_BALANCE_MIN_MAX',
  'BUREAU_MONTHS_BALANCE_MEAN_MAX',
  'BUREAU_MONTHS_BALANCE_SUM_MAX'],
 'bureau_DAYS_CREDIT_sum': ['bureau_CREDIT_ACTIVE_Closed_sum',
  'bureau_CREDIT_CURRENCY_currency 1_sum',
  'bureau_CREDIT_TYPE_Consumer credit_sum',
  'bureau_previous_loan_counts',
  'bureau_DAYS_ENDDATE_FACT_sum',
  'bureau_DAYS_CREDIT_UPDATE_sum'],
 'bureau_CREDIT_DAY_OVERDUE_min': [],
 'bureau_CREDIT_DAY_OVERDUE_mean': [],
 'bureau_CREDIT_DAY_OVERDUE_max': ['bureau_CREDIT_DAY_OVERDUE_sum'],
 'bureau_CREDIT_DAY_OVERDUE_sum': ['bureau_CREDIT_DAY_OVERDUE_max'],
 'bureau_DAYS_CREDIT_ENDDATE_min': [],
 'bureau_DAYS_CREDIT_ENDDATE_mean': [],
 'bureau_DAYS_CREDIT_ENDDATE_max': ['bureau_DAYS_CREDIT_ENDDATE_sum'],
 'bureau_DAYS_CREDIT_ENDDATE_sum': ['bureau_DAYS_CREDIT_ENDDATE_max'],
 'bureau_DAYS_ENDDATE_FACT_min': ['bureau_DAYS_CREDIT_min',
  'bureau_DAYS_ENDDATE_FACT_mean',
  'BUREAU_MONTHS_BALANCE_MIN_MIN'],
 'bureau_DAYS_ENDDATE_FACT_mean': ['bureau_DAYS_ENDDATE_FACT_min'],
 'bureau_DAYS_ENDDATE_FACT_max': [],
 'bureau_DAYS_ENDDATE_FACT_sum': ['bureau_CREDIT_ACTIVE_Closed_sum',
  'bureau_CREDIT_CURRENCY_currency 1_sum',
  'bureau_CREDIT_TYPE_Consumer credit_sum',
  'bureau_previous_loan_counts',
  'bureau_DAYS_CREDIT_sum',
  'bureau_DAYS_CREDIT_UPDATE_sum'],
 'bureau_AMT_CREDIT_MAX_OVERDUE_min': ['bureau_AMT_CREDIT_MAX_OVERDUE_mean'],
 'bureau_AMT_CREDIT_MAX_OVERDUE_mean': ['bureau_AMT_CREDIT_MAX_OVERDUE_min',
  'bureau_AMT_CREDIT_MAX_OVERDUE_max',
  'bureau_AMT_CREDIT_MAX_OVERDUE_sum'],
 'bureau_AMT_CREDIT_MAX_OVERDUE_max': ['bureau_AMT_CREDIT_MAX_OVERDUE_mean',
  'bureau_AMT_CREDIT_MAX_OVERDUE_sum'],
 'bureau_AMT_CREDIT_MAX_OVERDUE_sum': ['bureau_AMT_CREDIT_MAX_OVERDUE_mean',
  'bureau_AMT_CREDIT_MAX_OVERDUE_max'],
 'bureau_CNT_CREDIT_PROLONG_min': [],
 'bureau_CNT_CREDIT_PROLONG_mean': [],
 'bureau_CNT_CREDIT_PROLONG_max': ['bureau_CNT_CREDIT_PROLONG_sum'],
 'bureau_CNT_CREDIT_PROLONG_sum': ['bureau_CNT_CREDIT_PROLONG_max'],
 'bureau_AMT_CREDIT_SUM_min': [],
 'bureau_AMT_CREDIT_SUM_mean': ['bureau_AMT_CREDIT_SUM_max'],
 'bureau_AMT_CREDIT_SUM_max': ['bureau_AMT_CREDIT_SUM_mean',
  'bureau_AMT_CREDIT_SUM_sum'],
 'bureau_AMT_CREDIT_SUM_sum': ['bureau_AMT_CREDIT_SUM_max'],
 'bureau_AMT_CREDIT_SUM_DEBT_min': [],
 'bureau_AMT_CREDIT_SUM_DEBT_mean': [],
 'bureau_AMT_CREDIT_SUM_DEBT_max': ['bureau_AMT_CREDIT_SUM_DEBT_sum'],
 'bureau_AMT_CREDIT_SUM_DEBT_sum': ['bureau_AMT_CREDIT_SUM_DEBT_max'],
 'bureau_AMT_CREDIT_SUM_LIMIT_min': [],
 'bureau_AMT_CREDIT_SUM_LIMIT_mean': [],
 'bureau_AMT_CREDIT_SUM_LIMIT_max': ['bureau_AMT_CREDIT_SUM_LIMIT_sum'],
 'bureau_AMT_CREDIT_SUM_LIMIT_sum': ['bureau_AMT_CREDIT_SUM_LIMIT_max'],
 'bureau_AMT_CREDIT_SUM_OVERDUE_min': [],
 'bureau_AMT_CREDIT_SUM_OVERDUE_mean': [],
 'bureau_AMT_CREDIT_SUM_OVERDUE_max': ['bureau_AMT_CREDIT_SUM_OVERDUE_sum'],
 'bureau_AMT_CREDIT_SUM_OVERDUE_sum': ['bureau_AMT_CREDIT_SUM_OVERDUE_max'],
 'bureau_DAYS_CREDIT_UPDATE_min': [],
 'bureau_DAYS_CREDIT_UPDATE_mean': [],
 'bureau_DAYS_CREDIT_UPDATE_max': [],
 'bureau_DAYS_CREDIT_UPDATE_sum': ['bureau_CREDIT_ACTIVE_Closed_sum',
  'bureau_DAYS_CREDIT_sum',
  'bureau_DAYS_ENDDATE_FACT_sum'],
 'bureau_AMT_ANNUITY_min': [],
 'bureau_AMT_ANNUITY_mean': [],
 'bureau_AMT_ANNUITY_max': ['bureau_AMT_ANNUITY_sum'],
 'bureau_AMT_ANNUITY_sum': ['bureau_AMT_ANNUITY_max'],
 'BUREAU_MONTHS_BALANCE_MIN_MIN': ['bureau_DAYS_CREDIT_min',
  'bureau_DAYS_ENDDATE_FACT_min',
  'BUREAU_MONTHS_BALANCE_MEAN_MIN',
  'BUREAU_MONTHS_BALANCE_SUM_MIN'],
 'BUREAU_MONTHS_BALANCE_MIN_MAX': ['bureau_DAYS_CREDIT_max',
  'BUREAU_MONTHS_BALANCE_MEAN_MAX',
  'BUREAU_MONTHS_BALANCE_SUM_MAX'],
 'BUREAU_MONTHS_BALANCE_MIN_MEAN': ['bureau_DAYS_CREDIT_mean',
  'BUREAU_MONTHS_BALANCE_MEAN_MEAN',
  'BUREAU_MONTHS_BALANCE_SUM_MEAN'],
 'BUREAU_MONTHS_BALANCE_MIN_SUM': ['BUREAU_MONTHS_BALANCE_MEAN_SUM',
  'BUREAU_MONTHS_BALANCE_SUM_SUM',
  'BUREAU_STATUS_0_SUM_SUM',
  'BUREAU_STATUS_C_MEAN_SUM',
  'BUREAU_STATUS_C_SUM_SUM'],
 'BUREAU_MONTHS_BALANCE_MAX_MIN': ['BUREAU_MONTHS_BALANCE_MAX_MEAN',
  'BUREAU_MONTHS_BALANCE_MEAN_MIN'],
 'BUREAU_MONTHS_BALANCE_MAX_MAX': [],
 'BUREAU_MONTHS_BALANCE_MAX_MEAN': ['BUREAU_MONTHS_BALANCE_MAX_MIN'],
 'BUREAU_MONTHS_BALANCE_MAX_SUM': ['BUREAU_MONTHS_BALANCE_MEAN_SUM'],
 'BUREAU_MONTHS_BALANCE_MEAN_MIN': ['bureau_DAYS_CREDIT_min',
  'BUREAU_MONTHS_BALANCE_MIN_MIN',
  'BUREAU_MONTHS_BALANCE_MAX_MIN'],
 'BUREAU_MONTHS_BALANCE_MEAN_MAX': ['bureau_DAYS_CREDIT_max',
  'BUREAU_MONTHS_BALANCE_MIN_MAX',
  'BUREAU_MONTHS_BALANCE_SUM_MAX'],
 'BUREAU_MONTHS_BALANCE_MEAN_MEAN': ['bureau_DAYS_CREDIT_mean',
  'BUREAU_MONTHS_BALANCE_MIN_MEAN'],
 'BUREAU_MONTHS_BALANCE_MEAN_SUM': ['BUREAU_MONTHS_BALANCE_MIN_SUM',
  'BUREAU_MONTHS_BALANCE_MAX_SUM',
  'BUREAU_MONTHS_BALANCE_SUM_SUM'],
 'BUREAU_MONTHS_BALANCE_SUM_MIN': ['bureau_DAYS_CREDIT_min',
  'BUREAU_MONTHS_BALANCE_MIN_MIN',
  'BUREAU_MONTHS_BALANCE_SUM_MEAN'],
 'BUREAU_MONTHS_BALANCE_SUM_MAX': ['bureau_DAYS_CREDIT_max',
  'BUREAU_MONTHS_BALANCE_MIN_MAX',
  'BUREAU_MONTHS_BALANCE_MEAN_MAX'],
 'BUREAU_MONTHS_BALANCE_SUM_MEAN': ['bureau_DAYS_CREDIT_mean',
  'BUREAU_MONTHS_BALANCE_MIN_MEAN',
  'BUREAU_MONTHS_BALANCE_SUM_MIN'],
 'BUREAU_MONTHS_BALANCE_SUM_SUM': ['BUREAU_MONTHS_BALANCE_MIN_SUM',
  'BUREAU_MONTHS_BALANCE_MEAN_SUM',
  'BUREAU_STATUS_C_SUM_SUM'],
 'BUREAU_STATUS_0_MEAN_MIN': [],
 'BUREAU_STATUS_0_MEAN_MAX': [],
 'BUREAU_STATUS_0_MEAN_MEAN': [],
 'BUREAU_STATUS_0_MEAN_SUM': ['BUREAU_STATUS_0_SUM_SUM'],
 'BUREAU_STATUS_0_SUM_MIN': [],
 'BUREAU_STATUS_0_SUM_MAX': [],
 'BUREAU_STATUS_0_SUM_MEAN': [],
 'BUREAU_STATUS_0_SUM_SUM': ['BUREAU_MONTHS_BALANCE_MIN_SUM',
  'BUREAU_STATUS_0_MEAN_SUM'],
 'BUREAU_STATUS_1_MEAN_MIN': [],
 'BUREAU_STATUS_1_MEAN_MAX': ['BUREAU_STATUS_1_MEAN_SUM'],
 'BUREAU_STATUS_1_MEAN_MEAN': [],
 'BUREAU_STATUS_1_MEAN_SUM': ['BUREAU_STATUS_1_MEAN_MAX',
  'BUREAU_STATUS_1_SUM_SUM'],
 'BUREAU_STATUS_1_SUM_MIN': [],
 'BUREAU_STATUS_1_SUM_MAX': ['BUREAU_STATUS_1_SUM_SUM'],
 'BUREAU_STATUS_1_SUM_MEAN': [],
 'BUREAU_STATUS_1_SUM_SUM': ['BUREAU_STATUS_1_MEAN_SUM',
  'BUREAU_STATUS_1_SUM_MAX'],
 'BUREAU_STATUS_2_MEAN_MIN': [],
 'BUREAU_STATUS_2_MEAN_MAX': ['BUREAU_STATUS_2_MEAN_MEAN',
  'BUREAU_STATUS_2_MEAN_SUM'],
 'BUREAU_STATUS_2_MEAN_MEAN': ['BUREAU_STATUS_2_MEAN_MAX'],
 'BUREAU_STATUS_2_MEAN_SUM': ['BUREAU_STATUS_2_MEAN_MAX',
  'BUREAU_STATUS_2_SUM_SUM'],
 'BUREAU_STATUS_2_SUM_MIN': [],
 'BUREAU_STATUS_2_SUM_MAX': ['BUREAU_STATUS_2_SUM_SUM'],
 'BUREAU_STATUS_2_SUM_MEAN': [],
 'BUREAU_STATUS_2_SUM_SUM': ['BUREAU_STATUS_2_MEAN_SUM',
  'BUREAU_STATUS_2_SUM_MAX'],
 'BUREAU_STATUS_3_MEAN_MIN': [],
 'BUREAU_STATUS_3_MEAN_MAX': ['BUREAU_STATUS_3_MEAN_MEAN',
  'BUREAU_STATUS_3_MEAN_SUM'],
 'BUREAU_STATUS_3_MEAN_MEAN': ['BUREAU_STATUS_3_MEAN_MAX'],
 'BUREAU_STATUS_3_MEAN_SUM': ['BUREAU_STATUS_3_MEAN_MAX',
  'BUREAU_STATUS_3_SUM_SUM'],
 'BUREAU_STATUS_3_SUM_MIN': [],
 'BUREAU_STATUS_3_SUM_MAX': ['BUREAU_STATUS_3_SUM_SUM'],
 'BUREAU_STATUS_3_SUM_MEAN': [],
 'BUREAU_STATUS_3_SUM_SUM': ['BUREAU_STATUS_3_MEAN_SUM',
  'BUREAU_STATUS_3_SUM_MAX'],
 'BUREAU_STATUS_4_MEAN_MIN': [],
 'BUREAU_STATUS_4_MEAN_MAX': ['BUREAU_STATUS_4_MEAN_MEAN',
  'BUREAU_STATUS_4_MEAN_SUM'],
 'BUREAU_STATUS_4_MEAN_MEAN': ['BUREAU_STATUS_4_MEAN_MAX',
  'BUREAU_STATUS_4_SUM_MEAN'],
 'BUREAU_STATUS_4_MEAN_SUM': ['BUREAU_STATUS_4_MEAN_MAX',
  'BUREAU_STATUS_4_SUM_SUM'],
 'BUREAU_STATUS_4_SUM_MIN': [],
 'BUREAU_STATUS_4_SUM_MAX': [],
 'BUREAU_STATUS_4_SUM_MEAN': ['BUREAU_STATUS_4_MEAN_MEAN'],
 'BUREAU_STATUS_4_SUM_SUM': ['BUREAU_STATUS_4_MEAN_SUM'],
 'BUREAU_STATUS_5_MEAN_MIN': ['BUREAU_STATUS_5_SUM_MIN'],
 'BUREAU_STATUS_5_MEAN_MAX': ['BUREAU_STATUS_5_SUM_MAX'],
 'BUREAU_STATUS_5_MEAN_MEAN': ['BUREAU_STATUS_5_SUM_MEAN'],
 'BUREAU_STATUS_5_MEAN_SUM': ['BUREAU_STATUS_5_SUM_SUM'],
 'BUREAU_STATUS_5_SUM_MIN': ['BUREAU_STATUS_5_MEAN_MIN'],
 'BUREAU_STATUS_5_SUM_MAX': ['BUREAU_STATUS_5_MEAN_MAX',
  'BUREAU_STATUS_5_SUM_SUM'],
 'BUREAU_STATUS_5_SUM_MEAN': ['BUREAU_STATUS_5_MEAN_MEAN'],
 'BUREAU_STATUS_5_SUM_SUM': ['BUREAU_STATUS_5_MEAN_SUM',
  'BUREAU_STATUS_5_SUM_MAX'],
 'BUREAU_STATUS_C_MEAN_MIN': ['BUREAU_STATUS_C_SUM_MIN'],
 'BUREAU_STATUS_C_MEAN_MAX': [],
 'BUREAU_STATUS_C_MEAN_MEAN': ['BUREAU_STATUS_C_SUM_MEAN'],
 'BUREAU_STATUS_C_MEAN_SUM': ['BUREAU_MONTHS_BALANCE_MIN_SUM',
  'BUREAU_STATUS_C_SUM_SUM'],
 'BUREAU_STATUS_C_SUM_MIN': ['BUREAU_STATUS_C_MEAN_MIN'],
 'BUREAU_STATUS_C_SUM_MAX': [],
 'BUREAU_STATUS_C_SUM_MEAN': ['BUREAU_STATUS_C_MEAN_MEAN'],
 'BUREAU_STATUS_C_SUM_SUM': ['BUREAU_MONTHS_BALANCE_MIN_SUM',
  'BUREAU_MONTHS_BALANCE_SUM_SUM',
  'BUREAU_STATUS_C_MEAN_SUM'],
 'BUREAU_STATUS_X_MEAN_MIN': [],
 'BUREAU_STATUS_X_MEAN_MAX': [],
 'BUREAU_STATUS_X_MEAN_MEAN': [],
 'BUREAU_STATUS_X_MEAN_SUM': ['BUREAU_STATUS_X_SUM_SUM'],
 'BUREAU_STATUS_X_SUM_MIN': [],
 'BUREAU_STATUS_X_SUM_MAX': ['BUREAU_STATUS_X_SUM_SUM'],
 'BUREAU_STATUS_X_SUM_MEAN': [],
 'BUREAU_STATUS_X_SUM_SUM': ['BUREAU_STATUS_X_MEAN_SUM',
  'BUREAU_STATUS_X_SUM_MAX'],
 'TARGET': []}

对于高度相关的特征对.

  • 如果

upper = corr_abs.where(
    np.triu(np.ones(corr_abs.shape), k=1).astype(bool)
    )   
upper
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION ... BUREAU_STATUS_C_SUM_SUM BUREAU_STATUS_X_MEAN_MIN BUREAU_STATUS_X_MEAN_MAX BUREAU_STATUS_X_MEAN_MEAN BUREAU_STATUS_X_MEAN_SUM BUREAU_STATUS_X_SUM_MIN BUREAU_STATUS_X_SUM_MAX BUREAU_STATUS_X_SUM_MEAN BUREAU_STATUS_X_SUM_SUM TARGET
SK_ID_CURR NaN 0.001129 0.001820 0.000343 0.000433 0.000232 0.000849 0.001500 0.001366 0.000973 ... 0.000252 0.003102 0.003164 0.000689 0.001937 0.003786 0.002939 0.003459 0.000683 0.002108
CNT_CHILDREN NaN NaN 0.012882 0.002145 0.021374 0.001827 0.025573 0.330938 0.239818 0.183395 ... 0.005527 0.001161 0.005020 0.002323 0.001398 0.003052 0.003205 0.004988 0.003957 0.019187
AMT_INCOME_TOTAL NaN NaN NaN 0.156870 0.191657 0.159610 0.074796 0.027261 0.064223 0.027805 ... 0.024610 0.031615 0.074149 0.020127 0.026904 0.021876 0.060925 0.022379 0.024900 0.003982
AMT_CREDIT NaN NaN NaN NaN 0.770138 0.986968 0.099738 0.055436 0.066838 0.009621 ... 0.023609 0.003360 0.023895 0.019604 0.015181 0.014507 0.038554 0.036641 0.024139 0.030369
AMT_ANNUITY NaN NaN NaN NaN NaN 0.775109 0.118429 0.009445 0.104332 0.038514 ... 0.101982 0.006178 0.019453 0.007627 0.076965 0.009223 0.036626 0.030085 0.077655 0.012817
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
BUREAU_STATUS_X_SUM_MIN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.238225 0.655103 0.137826 0.014557
BUREAU_STATUS_X_SUM_MAX NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 0.784659 0.806159 0.030919
BUREAU_STATUS_X_SUM_MEAN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 0.691279 0.033292
BUREAU_STATUS_X_SUM_SUM NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.008445
TARGET NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

418 rows × 418 columns

high_corr_pairs = upper.unstack().reset_index()
high_corr_pairs.columns = ['feature_1', 'feature_2', 'corr']
high_corr_pairs = high_corr_pairs[high_corr_pairs['corr'] > threshold]
high_corr_pairs
feature_1 feature_2 corr
2093 AMT_GOODS_PRICE AMT_CREDIT 0.986968
5442 FLAG_EMP_PHONE DAYS_EMPLOYED 0.999755
7525 CNT_FAM_MEMBERS CNT_CHILDREN 0.879161
8379 REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT 0.950842
10055 LIVE_REGION_NOT_WORK_REGION REG_REGION_NOT_WORK_REGION 0.860627
... ... ... ...
170884 BUREAU_STATUS_C_SUM_SUM BUREAU_MONTHS_BALANCE_MIN_SUM 0.847094
170896 BUREAU_STATUS_C_SUM_SUM BUREAU_MONTHS_BALANCE_SUM_SUM 0.892464
170948 BUREAU_STATUS_C_SUM_SUM BUREAU_STATUS_C_MEAN_SUM 0.911103
174300 BUREAU_STATUS_X_SUM_SUM BUREAU_STATUS_X_MEAN_SUM 0.857403
174302 BUREAU_STATUS_X_SUM_SUM BUREAU_STATUS_X_SUM_MAX 0.806159

220 rows × 3 columns

to_drop = set()
missing_counts = train.isnull().sum()
for index, row in high_corr_pairs.iterrows():
    f1, f2 = row['feature_1'], row['feature_2']
    if f1 in to_drop or f2 in to_drop: # 有一个已经在drop了
        continue
    if missing_counts[f1] > missing_counts[f2]:
        to_drop.add(f1)
    else:
        to_drop.add(f2)
to_drop = list(to_drop)
to_drop
['BUREAU_MONTHS_BALANCE_MIN_SUM',
 'CODE_GENDER_F',
 'BUREAU_STATUS_2_MEAN_SUM',
 'BUREAU_MONTHS_BALANCE_MAX_SUM',
 'bureau_CREDIT_ACTIVE_Closed_sum',
 'REG_CITY_NOT_WORK_CITY',
 'BUREAU_STATUS_C_MEAN_MEAN',
 'bureau_CREDIT_CURRENCY_currency 1_mean',
 'APARTMENTS_AVG',
 'LIVINGAPARTMENTS_MEDI',
 'bureau_CREDIT_TYPE_Consumer credit_mean',
 'FLAG_OWN_REALTY_N',
 'APARTMENTS_MODE',
 'BUREAU_STATUS_C_MEAN_SUM',
 'bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_sum',
 'ELEVATORS_MEDI',
 'LIVINGAREA_MODE',
 'ENTRANCES_AVG',
 'bureau_CREDIT_TYPE_Cash loan (non-earmarked)_sum',
 'REGION_RATING_CLIENT',
 'bureau_AMT_ANNUITY_max',
 'FLOORSMIN_MODE',
 'BUREAU_STATUS_5_MEAN_SUM',
 'YEARS_BUILD_AVG',
 'LIVINGAREA_AVG',
 'bureau_AMT_CREDIT_MAX_OVERDUE_mean',
 'FLAG_OWN_CAR_N',
 'BUREAU_STATUS_4_MEAN_MEAN',
 'bureau_previous_loan_counts',
 'BUREAU_STATUS_5_MEAN_MEAN',
 'BASEMENTAREA_AVG',
 'APARTMENTS_MEDI',
 'ELEVATORS_MODE',
 'AMT_GOODS_PRICE',
 'YEARS_BEGINEXPLUATATION_AVG',
 'BUREAU_STATUS_1_SUM_MAX',
 'BUREAU_STATUS_1_MEAN_MAX',
 'LANDAREA_AVG',
 'BUREAU_STATUS_3_MEAN_MAX',
 'FLAG_EMP_PHONE',
 'BUREAU_STATUS_5_SUM_MAX',
 'BUREAU_MONTHS_BALANCE_MIN_MIN',
 'BUREAU_STATUS_0_MEAN_SUM',
 'bureau_AMT_CREDIT_SUM_mean',
 'CNT_FAM_MEMBERS',
 'bureau_CREDIT_CURRENCY_currency 4_sum',
 'BUREAU_STATUS_3_SUM_MAX',
 'bureau_DAYS_ENDDATE_FACT_sum',
 'bureau_CNT_CREDIT_PROLONG_max',
 'BUREAU_MONTHS_BALANCE_SUM_SUM',
 'BUREAU_STATUS_4_MEAN_MAX',
 'BUREAU_STATUS_X_MEAN_SUM',
 'bureau_DAYS_CREDIT_sum',
 'BUREAU_STATUS_X_SUM_MAX',
 'bureau_AMT_CREDIT_SUM_max',
 'BUREAU_MONTHS_BALANCE_MEAN_MEAN',
 'NAME_EDUCATION_TYPE_Higher education',
 'BUREAU_STATUS_2_SUM_MAX',
 'NONLIVINGAREA_MODE',
 'FLOORSMIN_AVG',
 'HOUSETYPE_MODE_block of flats',
 'BUREAU_MONTHS_BALANCE_MIN_MEAN',
 'bureau_CREDIT_CURRENCY_currency 1_sum',
 'BUREAU_STATUS_5_MEAN_MAX',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'BUREAU_MONTHS_BALANCE_MEAN_SUM',
 'BUREAU_STATUS_2_MEAN_MAX',
 'ELEVATORS_AVG',
 'COMMONAREA_AVG',
 'BUREAU_MONTHS_BALANCE_SUM_MEAN',
 'bureau_CREDIT_ACTIVE_Active_mean',
 'bureau_AMT_CREDIT_SUM_DEBT_max',
 'LIVINGAREA_MEDI',
 'NONLIVINGAREA_AVG',
 'NONLIVINGAPARTMENTS_AVG',
 'BUREAU_MONTHS_BALANCE_MEAN_MAX',
 'BUREAU_STATUS_4_MEAN_SUM',
 'BUREAU_MONTHS_BALANCE_MAX_MIN',
 'bureau_DAYS_CREDIT_ENDDATE_max',
 'BUREAU_STATUS_C_MEAN_MIN',
 'BASEMENTAREA_MODE',
 'YEARS_BUILD_MODE',
 'LANDAREA_MODE',
 'bureau_CREDIT_DAY_OVERDUE_max',
 'FLOORSMAX_MODE',
 'bureau_DAYS_ENDDATE_FACT_min',
 'BUREAU_STATUS_3_MEAN_SUM',
 'FLOORSMAX_AVG',
 'REG_REGION_NOT_WORK_REGION',
 'COMMONAREA_MODE',
 'BUREAU_STATUS_1_MEAN_SUM',
 'NAME_INCOME_TYPE_Pensioner',
 'bureau_CREDIT_TYPE_Consumer credit_sum',
 'bureau_CREDIT_ACTIVE_Bad debt_sum',
 'BUREAU_MONTHS_BALANCE_MIN_MAX',
 'DAYS_EMPLOYED',
 'LIVINGAPARTMENTS_AVG',
 'NAME_CONTRACT_TYPE_Cash loans',
 'bureau_CREDIT_TYPE_Interbank credit_sum',
 'bureau_AMT_CREDIT_MAX_OVERDUE_min',
 'LIVINGAPARTMENTS_MODE',
 'BUREAU_STATUS_5_MEAN_MIN',
 'bureau_AMT_CREDIT_MAX_OVERDUE_max',
 'bureau_AMT_CREDIT_SUM_LIMIT_max',
 'YEARS_BEGINEXPLUATATION_MODE',
 'bureau_AMT_CREDIT_SUM_OVERDUE_max',
 'ENTRANCES_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'bureau_DAYS_CREDIT_min',
 'OBS_30_CNT_SOCIAL_CIRCLE',
 'bureau_CREDIT_TYPE_Mobile operator loan_sum',
 'BUREAU_MONTHS_BALANCE_SUM_MAX']

从训练集和测试集移除这些列,

len(to_drop)
112
train.columns
Index(['SK_ID_CURR', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE',
       'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION',
       ...
       'BUREAU_STATUS_C_SUM_SUM', 'BUREAU_STATUS_X_MEAN_MIN',
       'BUREAU_STATUS_X_MEAN_MAX', 'BUREAU_STATUS_X_MEAN_MEAN',
       'BUREAU_STATUS_X_MEAN_SUM', 'BUREAU_STATUS_X_SUM_MIN',
       'BUREAU_STATUS_X_SUM_MAX', 'BUREAU_STATUS_X_SUM_MEAN',
       'BUREAU_STATUS_X_SUM_SUM', 'TARGET'],
      dtype='str', length=418)
train_corrs_removed = train.drop(columns=to_drop)
test_corrs_removed = test.drop(columns=to_drop)
train_corrs_removed.to_feather('checkpoints/04_train_app_bureau_balance_bureau_cleaned.feather')
test_corrs_removed.to_feather('checkpoints/04_test_app_bureau_balance_bureau_cleaned.feather')
del corrs, corr_abs, train_corrs_removed, test_corrs_removed
gc.collect()
0

modeling#

导入#

  • 每个model运行前,建议重新导入一次

train = pd.read_feather('checkpoints/04_train_app_bureau_balance_bureau_cleaned.feather')
test = pd.read_feather('checkpoints/04_test_app_bureau_balance_bureau_cleaned.feather')

train_labels = train['TARGET']
train_ids = train['SK_ID_CURR']
test_ids = test['SK_ID_CURR']
train_features = train.drop(columns=['TARGET', 'SK_ID_CURR'])
test_features = test.drop(columns=['SK_ID_CURR'])
print(train.shape, test.shape)
(307511, 306) (48744, 305)

HistGradientBoostingClassifier#

from sklearn.ensemble import HistGradientBoostingClassifier

HistGradientBoostingClassifier

  • 不需要处理缺失值

  • 树模型对量级不敏感,不需要scaler

%%time
hist_gradient_boost_model= HistGradientBoostingClassifier(
    max_iter = 100, # 树个数
    learning_rate = 0.1,
    max_depth = 5,
)
hist_gradient_boost_model.fit(train_features, train_labels)
CPU times: total: 2min 15s
Wall time: 17.6 s
HistGradientBoostingClassifier(max_depth=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
from sklearn.metrics import roc_auc_score
train_prob = hist_gradient_boost_model.predict_proba(train_features)
train_prob
array([[0.57789429, 0.42210571],
       [0.96788937, 0.03211063],
       [0.95496642, 0.04503358],
       ...,
       [0.91929918, 0.08070082],
       [0.92218583, 0.07781417],
       [0.9186571 , 0.0813429 ]], shape=(307511, 2))
from sklearn.metrics import  roc_curve
fpr, tpr, thresholds = roc_curve(train_labels, train_prob[:, 1])
auc = roc_auc_score(train_labels, train_prob[:, 1])
plt.figure(figsize=(3,3))
plt.plot(fpr, tpr, color='blue', lw=2)
plt.title(f'hist gb Roc curve, auc={auc:.3f}')
Text(0.5, 1.0, 'hist gb Roc curve, auc=0.789')
../../_images/34b02721c714f69626ce670877d52a3cd0915414fb7c4094307aaf9d4978050d.png
hist_gradient_boost_model_pred = hist_gradient_boost_model.predict_proba(test_features)
submit = pd.DataFrame({
    'SK_ID_CURR': test_ids
})
submit['TARGET'] = hist_gradient_boost_model_pred[:, 1]

submit.to_csv('hist_gradient_boost_model_with_bureau.csv', index = False)
submit.head()
SK_ID_CURR TARGET
0 100001 0.041575
1 100005 0.161436
2 100013 0.020080
3 100028 0.029833
4 100038 0.178684
submit.shape
(48744, 2)

得分73

lightgbm#

  • 需要清理列名

import re
# 1. 定义清理函数
def clean_names(df):
    # 替换所有非字母、数字的字符为下划线
    # 这里的正则 [^A-Za-z0-9_] 会匹配空格、斜杠、括号等所有特殊字符
    df.columns = [re.sub(r'[^A-Za-z0-9_]+', '_', col) for col in df.columns]
    # 顺便处理一下可能出现的重复下划线,比如 __
    df.columns = [re.sub(r'_+', '_', col).strip('_') for col in df.columns]
    return df
train_features = clean_names(train_features)
test_features = clean_names(test_features)
%%time
from lightgbm import LGBMClassifier
lgbm_model = LGBMClassifier(
    n_estimators=100,      # 对应 max_iter,树的个数
    learning_rate=0.1,     # 学习率
    max_depth=3,           # 树的最大深度
    random_state=42,       # 保证结果可复现
    n_jobs=-1              # 使用所有 CPU 核心加速
)
lgbm_model.fit(train_features, train_labels)
[LightGBM] [Info] Number of positive: 24825, number of negative: 282686
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.109867 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 19607
[LightGBM] [Info] Number of data points in the train set: 307511, number of used features: 298
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432486
[LightGBM] [Info] Start training from score -2.432486
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
CPU times: total: 25.7 s
Wall time: 2.7 s
LGBMClassifier(max_depth=3, n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
lgbm_model_pred = lgbm_model.predict_proba(test_features)[:, 1]
submit = pd.DataFrame(
    {
        'SK_ID_CURR': test_ids
    }
)
submit['TARGET'] = lgbm_model_pred

submit.to_csv('lgbm_model_pred_with_bureau.csv', index = False)

得分 73

features_importance = pd.DataFrame(
    {
        'importance': lgbm_model.feature_importances_,
        'feature': lgbm_model.feature_name_
    }
)
features_importance_plot = features_importance.sort_values(by='importance', ascending=False).head(20)
features_importance_plot.head()
importance feature
20 72 EXT_SOURCE_1
21 71 EXT_SOURCE_2
22 67 EXT_SOURCE_3
2 40 AMT_CREDIT
5 35 DAYS_BIRTH
228 21 bureau_DAYS_CREDIT_max
64 20 CODE_GENDER_M
244 18 bureau_AMT_CREDIT_SUM_DEBT_mean
3 18 AMT_ANNUITY
8 16 OWN_CAR_AGE
193 15 bureau_CREDIT_ACTIVE_Active_sum
242 14 bureau_AMT_CREDIT_SUM_sum
237 12 bureau_AMT_CREDIT_MAX_OVERDUE_sum
7 11 DAYS_ID_PUBLISH
83 11 NAME_EDUCATION_TYPE_Secondary_secondary_special
36 11 DAYS_LAST_PHONE_CHANGE
227 10 bureau_DAYS_CREDIT_mean
35 10 DEF_60_CNT_SOCIAL_CIRCLE
14 10 REGION_RATING_CLIENT_W_CITY
63 9 NAME_CONTRACT_TYPE_Revolving_loans
plt.figure(figsize=(10,6))
sns.barplot(
    data = features_importance_plot,
    x= 'importance',
    y = 'feature'
)
plt.tight_layout()
../../_images/0f92349c2825c6417a4d61b6453b1298e3a4dbcf6df7c8e8abe989eddd95019b.png

还是有点提升的,我们看到有一些新的重要特征

第二部分#

  • 按照第一部分步骤,做一些最基本的处理。使用previous_application POS_CASH_balance installments_payments credit_card_balance文件

def get_missing_columns(df, rate=90):
    """只计算需要删除的列名"""
    missing_stats = df.isnull().sum() / len(df) * 100
    to_drop = missing_stats[missing_stats > rate].index.tolist()
    return to_drop
def get_high_corr_columns(df, threshold=0.9):
    """
    高效获取高相关特征,优先保留缺失值较少的特征
    """
    # 1. 计算相关性矩阵
    corr_matrix = df.corr().abs()
    
    # 2. 提取上三角(不含对角线)
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    # 3. 找出所有超过阈值的列名
    # 这里的 to_drop 是我们要剔除的特征候选名单
    to_drop = set()
    
    # 4. 获取缺失值统计
    missing_counts = df.isnull().sum()
    
    # 5. 遍历每一列,检查是否存在高相关
    for column in upper.columns:
        # 找到与当前列 column 相关性大于阈值的所有特征
        high_corr_features = upper[column][upper[column] > threshold].index.tolist()
        
        for feature in high_corr_features:
            # 比较 column 和 feature 的缺失值情况
            # 谁缺失多删谁
            if missing_counts[column] > missing_counts[feature]:
                to_drop.add(column)
                break # column 既然要被删了,就不用再看它与其他特征的关系了
            else:
                to_drop.add(feature)
                
    return list(to_drop)
def feature_select(train, test):
    """ 移除 高缺失值列和高相关特征
    """ 
    train = train.copy()
    test = test.copy()

    train_labels = train['TARGET']
    train_ids = train['SK_ID_CURR']
    test_ids = test['SK_ID_CURR']

    # 这两列不参与
    train = train.drop(columns=['TARGET', 'SK_ID_CURR'])
    test = test.drop(columns=['SK_ID_CURR'])

    train = train.drop(columns=get_missing_columns(train))
    test = test.drop(columns=get_missing_columns(test))
    print('remove high missing cols. ', train.shape, test.shape)

    train, test = train.align(test, join='inner', axis=1)
    print('align train and test.', train.shape, test.shape)
    
    train_sample = train.sample(n=int(len(train) * 0.3))
    to_drop_columns = get_high_corr_columns(train_sample)
    train = train.drop(columns=to_drop_columns)
    test = test.drop(columns=to_drop_columns)
    
    train['TARGET'] = train_labels
    train['SK_ID_CURR'] = train_ids
    test['SK_ID_CURR'] = test_ids

    print('remove high corr cols.', train.shape, test.shape)

    return train, test
from sklearn.metrics import roc_auc_score
from sklearn.metrics import  roc_curve

def plot_roc(targets, prob, name):
    fpr, tpr, thresholds = roc_curve(targets, prob)
    auc = roc_auc_score(targets, prob)
    plt.figure(figsize=(3,3))
    plt.plot(fpr, tpr, color='blue', lw=2)
    plt.title(f'{name} Roc curve, auc={auc:.3f}')

引入previous_application表#

  • previous_agg_by_client

previous = pd.read_csv('data/previous_application.csv')
previous.head()
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 ... Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 ... XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 ... XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 ... XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 ... XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN

5 rows × 37 columns

previous.shape
(1670214, 37)
previous.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
       'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
       'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
       'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
       'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
       'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
       'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
       'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
       'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
       'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
      dtype='str')
previous.dtypes
SK_ID_PREV                       int64
SK_ID_CURR                       int64
NAME_CONTRACT_TYPE                 str
AMT_ANNUITY                    float64
AMT_APPLICATION                float64
AMT_CREDIT                     float64
AMT_DOWN_PAYMENT               float64
AMT_GOODS_PRICE                float64
WEEKDAY_APPR_PROCESS_START         str
HOUR_APPR_PROCESS_START          int64
FLAG_LAST_APPL_PER_CONTRACT        str
NFLAG_LAST_APPL_IN_DAY           int64
RATE_DOWN_PAYMENT              float64
RATE_INTEREST_PRIMARY          float64
RATE_INTEREST_PRIVILEGED       float64
NAME_CASH_LOAN_PURPOSE             str
NAME_CONTRACT_STATUS               str
DAYS_DECISION                    int64
NAME_PAYMENT_TYPE                  str
CODE_REJECT_REASON                 str
NAME_TYPE_SUITE                    str
NAME_CLIENT_TYPE                   str
NAME_GOODS_CATEGORY                str
NAME_PORTFOLIO                     str
NAME_PRODUCT_TYPE                  str
CHANNEL_TYPE                       str
SELLERPLACE_AREA                 int64
NAME_SELLER_INDUSTRY               str
CNT_PAYMENT                    float64
NAME_YIELD_GROUP                   str
PRODUCT_COMBINATION                str
DAYS_FIRST_DRAWING             float64
DAYS_FIRST_DUE                 float64
DAYS_LAST_DUE_1ST_VERSION      float64
DAYS_LAST_DUE                  float64
DAYS_TERMINATION               float64
NFLAG_INSURED_ON_APPROVAL      float64
dtype: object
previous_categorical_agg = agg_categorical(previous, ['SK_ID_PREV', 'SK_ID_CURR'], 'previous')
previous_categorical_agg.head()
SK_ID_PREV SK_ID_CURR previous_NAME_CONTRACT_TYPE_Cash loans_mean previous_NAME_CONTRACT_TYPE_Cash loans_sum previous_NAME_CONTRACT_TYPE_Consumer loans_mean previous_NAME_CONTRACT_TYPE_Consumer loans_sum previous_NAME_CONTRACT_TYPE_Revolving loans_mean previous_NAME_CONTRACT_TYPE_Revolving loans_sum previous_NAME_CONTRACT_TYPE_XNA_mean previous_NAME_CONTRACT_TYPE_XNA_sum ... previous_PRODUCT_COMBINATION_POS industry without interest_mean previous_PRODUCT_COMBINATION_POS industry without interest_sum previous_PRODUCT_COMBINATION_POS mobile with interest_mean previous_PRODUCT_COMBINATION_POS mobile with interest_sum previous_PRODUCT_COMBINATION_POS mobile without interest_mean previous_PRODUCT_COMBINATION_POS mobile without interest_sum previous_PRODUCT_COMBINATION_POS other with interest_mean previous_PRODUCT_COMBINATION_POS other with interest_sum previous_PRODUCT_COMBINATION_POS others without interest_mean previous_PRODUCT_COMBINATION_POS others without interest_sum
0 1000001 158271 0.0 0 1.0 1 0.0 0 0.0 0 ... 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0
1 1000002 101962 0.0 0 1.0 1 0.0 0 0.0 0 ... 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0
2 1000003 252457 0.0 0 1.0 1 0.0 0 0.0 0 ... 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0
3 1000004 260094 0.0 0 1.0 1 0.0 0 0.0 0 ... 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0
4 1000005 176456 0.0 0 1.0 1 0.0 0 0.0 0 ... 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0

5 rows × 288 columns

previous_numeric_agg = agg_numeric(previous,['SK_ID_PREV', 'SK_ID_CURR'], 'previous')
previous_numeric_agg.head()
SK_ID_PREV SK_ID_CURR PREVIOUS_AMT_ANNUITY_MIN PREVIOUS_AMT_ANNUITY_MAX PREVIOUS_AMT_ANNUITY_MEAN PREVIOUS_AMT_ANNUITY_SUM PREVIOUS_AMT_APPLICATION_MIN PREVIOUS_AMT_APPLICATION_MAX PREVIOUS_AMT_APPLICATION_MEAN PREVIOUS_AMT_APPLICATION_SUM ... PREVIOUS_DAYS_LAST_DUE_MEAN PREVIOUS_DAYS_LAST_DUE_SUM PREVIOUS_DAYS_TERMINATION_MIN PREVIOUS_DAYS_TERMINATION_MAX PREVIOUS_DAYS_TERMINATION_MEAN PREVIOUS_DAYS_TERMINATION_SUM PREVIOUS_NFLAG_INSURED_ON_APPROVAL_MIN PREVIOUS_NFLAG_INSURED_ON_APPROVAL_MAX PREVIOUS_NFLAG_INSURED_ON_APPROVAL_MEAN PREVIOUS_NFLAG_INSURED_ON_APPROVAL_SUM
0 1000001 158271 6404.310 6404.310 6404.310 6404.310 58905.000 58905.000 58905.000 58905.000 ... -238.0 -238.0 -233.0 -233.0 -233.0 -233.0 0.0 0.0 0.0 0.0
1 1000002 101962 6264.000 6264.000 6264.000 6264.000 39145.500 39145.500 39145.500 39145.500 ... -1510.0 -1510.0 -1501.0 -1501.0 -1501.0 -1501.0 0.0 0.0 0.0 0.0
2 1000003 252457 4951.350 4951.350 4951.350 4951.350 47056.275 47056.275 47056.275 47056.275 ... 365243.0 365243.0 365243.0 365243.0 365243.0 365243.0 1.0 1.0 1.0 1.0
3 1000004 260094 3391.110 3391.110 3391.110 3391.110 35144.370 35144.370 35144.370 35144.370 ... -682.0 -682.0 -672.0 -672.0 -672.0 -672.0 0.0 0.0 0.0 0.0
4 1000005 176456 14713.605 14713.605 14713.605 14713.605 123486.075 123486.075 123486.075 123486.075 ... -1418.0 -1418.0 -1415.0 -1415.0 -1415.0 -1415.0 0.0 0.0 0.0 0.0

5 rows × 78 columns

previous_agg = pd.merge(previous_numeric_agg, previous_categorical_agg, on=['SK_ID_PREV', 'SK_ID_CURR'], how='left')
previous_agg.head()
SK_ID_PREV SK_ID_CURR PREVIOUS_AMT_ANNUITY_MIN PREVIOUS_AMT_ANNUITY_MAX PREVIOUS_AMT_ANNUITY_MEAN PREVIOUS_AMT_ANNUITY_SUM PREVIOUS_AMT_APPLICATION_MIN PREVIOUS_AMT_APPLICATION_MAX PREVIOUS_AMT_APPLICATION_MEAN PREVIOUS_AMT_APPLICATION_SUM ... previous_PRODUCT_COMBINATION_POS industry without interest_mean previous_PRODUCT_COMBINATION_POS industry without interest_sum previous_PRODUCT_COMBINATION_POS mobile with interest_mean previous_PRODUCT_COMBINATION_POS mobile with interest_sum previous_PRODUCT_COMBINATION_POS mobile without interest_mean previous_PRODUCT_COMBINATION_POS mobile without interest_sum previous_PRODUCT_COMBINATION_POS other with interest_mean previous_PRODUCT_COMBINATION_POS other with interest_sum previous_PRODUCT_COMBINATION_POS others without interest_mean previous_PRODUCT_COMBINATION_POS others without interest_sum
0 1000001 158271 6404.310 6404.310 6404.310 6404.310 58905.000 58905.000 58905.000 58905.000 ... 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0
1 1000002 101962 6264.000 6264.000 6264.000 6264.000 39145.500 39145.500 39145.500 39145.500 ... 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0
2 1000003 252457 4951.350 4951.350 4951.350 4951.350 47056.275 47056.275 47056.275 47056.275 ... 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0
3 1000004 260094 3391.110 3391.110 3391.110 3391.110 35144.370 35144.370 35144.370 35144.370 ... 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0
4 1000005 176456 14713.605 14713.605 14713.605 14713.605 123486.075 123486.075 123486.075 123486.075 ... 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0

5 rows × 364 columns

按照client聚合

previous_agg_by_client = agg_numeric(previous_agg, 'SK_ID_CURR', exclude_columns=['SK_ID_PREV'])
previous_agg_by_client.columns
Index(['SK_ID_CURR', 'PREVIOUS_AMT_ANNUITY_MIN_MIN',
       'PREVIOUS_AMT_ANNUITY_MIN_MAX', 'PREVIOUS_AMT_ANNUITY_MIN_MEAN',
       'PREVIOUS_AMT_ANNUITY_MIN_SUM', 'PREVIOUS_AMT_ANNUITY_MAX_MIN',
       'PREVIOUS_AMT_ANNUITY_MAX_MAX', 'PREVIOUS_AMT_ANNUITY_MAX_MEAN',
       'PREVIOUS_AMT_ANNUITY_MAX_SUM', 'PREVIOUS_AMT_ANNUITY_MEAN_MIN',
       ...
       'PREVIOUS_PRODUCT_COMBINATION_POS OTHER WITH INTEREST_SUM_MEAN',
       'PREVIOUS_PRODUCT_COMBINATION_POS OTHER WITH INTEREST_SUM_SUM',
       'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_MIN',
       'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_MAX',
       'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_MEAN',
       'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_SUM',
       'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MIN',
       'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MAX',
       'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MEAN',
       'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_SUM'],
      dtype='str', length=1449)
previous_agg_by_client.head()
SK_ID_CURR PREVIOUS_AMT_ANNUITY_MIN_MIN PREVIOUS_AMT_ANNUITY_MIN_MAX PREVIOUS_AMT_ANNUITY_MIN_MEAN PREVIOUS_AMT_ANNUITY_MIN_SUM PREVIOUS_AMT_ANNUITY_MAX_MIN PREVIOUS_AMT_ANNUITY_MAX_MAX PREVIOUS_AMT_ANNUITY_MAX_MEAN PREVIOUS_AMT_ANNUITY_MAX_SUM PREVIOUS_AMT_ANNUITY_MEAN_MIN ... PREVIOUS_PRODUCT_COMBINATION_POS OTHER WITH INTEREST_SUM_MEAN PREVIOUS_PRODUCT_COMBINATION_POS OTHER WITH INTEREST_SUM_SUM PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_MIN PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_MAX PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_MEAN PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_SUM PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MIN PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MAX PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MEAN PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_SUM
0 100001 3951.000 3951.000 3951.000 3951.000 3951.000 3951.000 3951.000 3951.000 3951.000 ... 0.0 0 0.0 0.0 0.0 0.0 0 0 0.0 0
1 100002 9251.775 9251.775 9251.775 9251.775 9251.775 9251.775 9251.775 9251.775 9251.775 ... 1.0 1 0.0 0.0 0.0 0.0 0 0 0.0 0
2 100003 6737.310 98356.995 56553.990 169661.970 6737.310 98356.995 56553.990 169661.970 6737.310 ... 0.0 0 0.0 0.0 0.0 0.0 0 0 0.0 0
3 100004 5357.250 5357.250 5357.250 5357.250 5357.250 5357.250 5357.250 5357.250 5357.250 ... 0.0 0 0.0 0.0 0.0 0.0 0 0 0.0 0
4 100005 4813.200 4813.200 4813.200 4813.200 4813.200 4813.200 4813.200 4813.200 4813.200 ... 0.0 0 0.0 0.0 0.0 0.0 0 0 0.0 0

5 rows × 1449 columns

sk_id_prev_cnts = previous_agg.groupby(by='SK_ID_CURR')['SK_ID_PREV'].count().reset_index().rename(columns = {'SK_ID_PREV' : 'prev_applications_counts'})
sk_id_prev_cnts
SK_ID_CURR prev_applications_counts
0 100001 1
1 100002 1
2 100003 3
3 100004 1
4 100005 2
... ... ...
338852 456251 1
338853 456252 1
338854 456253 2
338855 456254 2
338856 456255 8

338857 rows × 2 columns

previous_agg_by_client = pd.merge(
    previous_agg_by_client,
    sk_id_prev_cnts,
    on = 'SK_ID_CURR',
    how = 'left'
)
previous_agg_by_client
SK_ID_CURR PREVIOUS_AMT_ANNUITY_MIN_MIN PREVIOUS_AMT_ANNUITY_MIN_MAX PREVIOUS_AMT_ANNUITY_MIN_MEAN PREVIOUS_AMT_ANNUITY_MIN_SUM PREVIOUS_AMT_ANNUITY_MAX_MIN PREVIOUS_AMT_ANNUITY_MAX_MAX PREVIOUS_AMT_ANNUITY_MAX_MEAN PREVIOUS_AMT_ANNUITY_MAX_SUM PREVIOUS_AMT_ANNUITY_MEAN_MIN ... PREVIOUS_PRODUCT_COMBINATION_POS OTHER WITH INTEREST_SUM_SUM PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_MIN PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_MAX PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_MEAN PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_MEAN_SUM PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MIN PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MAX PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MEAN PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_SUM prev_applications_counts
0 100001 3951.000 3951.000 3951.000000 3951.000 3951.000 3951.000 3951.000000 3951.000 3951.000 ... 0 0.0 0.0 0.0 0.0 0 0 0.0 0 1
1 100002 9251.775 9251.775 9251.775000 9251.775 9251.775 9251.775 9251.775000 9251.775 9251.775 ... 1 0.0 0.0 0.0 0.0 0 0 0.0 0 1
2 100003 6737.310 98356.995 56553.990000 169661.970 6737.310 98356.995 56553.990000 169661.970 6737.310 ... 0 0.0 0.0 0.0 0.0 0 0 0.0 0 3
3 100004 5357.250 5357.250 5357.250000 5357.250 5357.250 5357.250 5357.250000 5357.250 5357.250 ... 0 0.0 0.0 0.0 0.0 0 0 0.0 0 1
4 100005 4813.200 4813.200 4813.200000 4813.200 4813.200 4813.200 4813.200000 4813.200 4813.200 ... 0 0.0 0.0 0.0 0.0 0 0 0.0 0 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
338852 456251 6605.910 6605.910 6605.910000 6605.910 6605.910 6605.910 6605.910000 6605.910 6605.910 ... 0 0.0 0.0 0.0 0.0 0 0 0.0 0 1
338853 456252 10074.465 10074.465 10074.465000 10074.465 10074.465 10074.465 10074.465000 10074.465 10074.465 ... 0 0.0 0.0 0.0 0.0 0 0 0.0 0 1
338854 456253 3973.095 5567.715 4770.405000 9540.810 3973.095 5567.715 4770.405000 9540.810 3973.095 ... 0 0.0 0.0 0.0 0.0 0 0 0.0 0 2
338855 456254 2296.440 19065.825 10681.132500 21362.265 2296.440 19065.825 10681.132500 21362.265 2296.440 ... 0 0.0 0.0 0.0 0.0 0 0 0.0 0 2
338856 456255 2250.000 54022.140 20775.391875 166203.135 2250.000 54022.140 20775.391875 166203.135 2250.000 ... 0 0.0 0.0 0.0 0.0 0 0 0.0 0 8

338857 rows × 1450 columns

previous_agg_by_client.to_feather('checkpoints/02_previous_agg.feather')
del previous, previous_categorical_agg, previous_numeric_agg, previous_agg,sk_id_prev_cnts
gc.collect()
6381

特征选择

train = pd.read_feather('checkpoints/01_train_app_base.feather')
test = pd.read_feather('checkpoints/01_test_app_base.feather')
previous_agg = pd.read_feather('checkpoints/02_previous_agg.feather')

train = pd.merge(train, previous_agg, on='SK_ID_CURR', how='left')
test = pd.merge(test, previous_agg, on='SK_ID_CURR', how='left')

print(f'train: {train.shape}, test: {test.shape}')
train, test = feature_select(train, test)
print(f'train: {train.shape}, test: {test.shape}')

train.to_feather('checkpoints/04_train_app_previous_cleaned.feather')
test.to_feather('checkpoints/04_test_app_previous_cleaned.feather')
train: (307511, 1692), test: (48744, 1691)
remove high missing cols.  (307511, 1672) (48744, 1672)
align train and test. (307511, 1672) (48744, 1672)
remove high corr cols. (307511, 764) (48744, 763)
train: (307511, 764), test: (48744, 763)

credit_card_balance#

credit_card_balance = pd.read_csv('data/credit_card_balance.csv')
credit_card_balance.head()
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY ... AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.970 135000 0.0 877.5 0.0 877.5 1700.325 ... 0.000 0.000 0.0 1 0.0 1.0 35.0 Active 0 0
1 2582071 363914 -1 63975.555 45000 2250.0 2250.0 0.0 0.0 2250.000 ... 64875.555 64875.555 1.0 1 0.0 0.0 69.0 Active 0 0
2 1740877 371185 -7 31815.225 450000 0.0 0.0 0.0 0.0 2250.000 ... 31460.085 31460.085 0.0 0 0.0 0.0 30.0 Active 0 0
3 1389973 337855 -4 236572.110 225000 2250.0 2250.0 0.0 0.0 11795.760 ... 233048.970 233048.970 1.0 1 0.0 0.0 10.0 Active 0 0
4 1891521 126868 -1 453919.455 450000 0.0 11547.0 0.0 11547.0 22924.890 ... 453919.455 453919.455 0.0 1 0.0 1.0 101.0 Active 0 0

5 rows × 23 columns

credit_card_balance.dtypes
SK_ID_PREV                      int64
SK_ID_CURR                      int64
MONTHS_BALANCE                  int64
AMT_BALANCE                   float64
AMT_CREDIT_LIMIT_ACTUAL         int64
AMT_DRAWINGS_ATM_CURRENT      float64
AMT_DRAWINGS_CURRENT          float64
AMT_DRAWINGS_OTHER_CURRENT    float64
AMT_DRAWINGS_POS_CURRENT      float64
AMT_INST_MIN_REGULARITY       float64
AMT_PAYMENT_CURRENT           float64
AMT_PAYMENT_TOTAL_CURRENT     float64
AMT_RECEIVABLE_PRINCIPAL      float64
AMT_RECIVABLE                 float64
AMT_TOTAL_RECEIVABLE          float64
CNT_DRAWINGS_ATM_CURRENT      float64
CNT_DRAWINGS_CURRENT            int64
CNT_DRAWINGS_OTHER_CURRENT    float64
CNT_DRAWINGS_POS_CURRENT      float64
CNT_INSTALMENT_MATURE_CUM     float64
NAME_CONTRACT_STATUS              str
SK_DPD                          int64
SK_DPD_DEF                      int64
dtype: object
credit_card_balance_numeric_agg = agg_numeric(credit_card_balance, ['SK_ID_CURR', 'SK_ID_PREV'], 'credit_card_balance')
credit_card_balance_numeric_agg
SK_ID_CURR SK_ID_PREV CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN CREDIT_CARD_BALANCE_MONTHS_BALANCE_SUM CREDIT_CARD_BALANCE_AMT_BALANCE_MIN CREDIT_CARD_BALANCE_AMT_BALANCE_MAX CREDIT_CARD_BALANCE_AMT_BALANCE_MEAN CREDIT_CARD_BALANCE_AMT_BALANCE_SUM ... CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MEAN CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM CREDIT_CARD_BALANCE_SK_DPD_MIN CREDIT_CARD_BALANCE_SK_DPD_MAX CREDIT_CARD_BALANCE_SK_DPD_MEAN CREDIT_CARD_BALANCE_SK_DPD_SUM CREDIT_CARD_BALANCE_SK_DPD_DEF_MIN CREDIT_CARD_BALANCE_SK_DPD_DEF_MAX CREDIT_CARD_BALANCE_SK_DPD_DEF_MEAN CREDIT_CARD_BALANCE_SK_DPD_DEF_SUM
0 100006 1489396 -6 -1 -3.5 -21 0.000 0.000 0.000000 0.000 ... 0.000000 0.0 0 0 0.000000 0 0 0 0.000000 0
1 100011 1843384 -75 -2 -38.5 -2849 0.000 189000.000 54482.111149 4031676.225 ... 25.767123 1881.0 0 0 0.000000 0 0 0 0.000000 0
2 100013 2038692 -96 -1 -48.5 -4656 0.000 161420.220 18159.919219 1743352.245 ... 18.719101 1666.0 0 1 0.010417 1 0 1 0.010417 1
3 100021 2594025 -18 -2 -10.0 -170 0.000 0.000 0.000000 0.000 ... 0.000000 0.0 0 0 0.000000 0 0 0 0.000000 0
4 100023 1499902 -11 -4 -7.5 -60 0.000 0.000 0.000000 0.000 ... 0.000000 0.0 0 0 0.000000 0 0 0 0.000000 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
104302 456244 2181926 -41 -1 -21.0 -861 0.000 453627.675 131834.730732 5405223.960 ... 13.600000 544.0 0 0 0.000000 0 0 0 0.000000 0
104303 456246 1079732 -9 -2 -5.5 -44 0.000 43490.115 13136.731875 105093.855 ... 3.500000 28.0 0 0 0.000000 0 0 0 0.000000 0
104304 456247 1595171 -96 -2 -49.0 -4655 0.000 190202.130 23216.396211 2205557.640 ... 26.494737 2517.0 0 1 0.031579 3 0 1 0.021053 2
104305 456248 2743495 -24 -2 -13.0 -299 0.000 0.000 0.000000 0.000 ... 0.000000 0.0 0 0 0.000000 0 0 0 0.000000 0
104306 456250 1794451 -12 -1 -6.5 -78 153832.725 200208.915 173589.326250 2083071.915 ... 4.583333 55.0 0 0 0.000000 0 0 0 0.000000 0

104307 rows × 82 columns

credit_card_balance_categorical_agg = agg_categorical(credit_card_balance, ['SK_ID_CURR', 'SK_ID_PREV'], 'credit_card_balance')
credit_card_balance_categorical_agg
SK_ID_CURR SK_ID_PREV credit_card_balance_NAME_CONTRACT_STATUS_Active_mean credit_card_balance_NAME_CONTRACT_STATUS_Active_sum credit_card_balance_NAME_CONTRACT_STATUS_Approved_mean credit_card_balance_NAME_CONTRACT_STATUS_Approved_sum credit_card_balance_NAME_CONTRACT_STATUS_Completed_mean credit_card_balance_NAME_CONTRACT_STATUS_Completed_sum credit_card_balance_NAME_CONTRACT_STATUS_Demand_mean credit_card_balance_NAME_CONTRACT_STATUS_Demand_sum credit_card_balance_NAME_CONTRACT_STATUS_Refused_mean credit_card_balance_NAME_CONTRACT_STATUS_Refused_sum credit_card_balance_NAME_CONTRACT_STATUS_Sent proposal_mean credit_card_balance_NAME_CONTRACT_STATUS_Sent proposal_sum credit_card_balance_NAME_CONTRACT_STATUS_Signed_mean credit_card_balance_NAME_CONTRACT_STATUS_Signed_sum
0 100006 1489396 1.000000 6 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0
1 100011 1843384 1.000000 74 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0
2 100013 2038692 1.000000 96 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0
3 100021 2594025 0.411765 7 0.0 0 0.588235 10 0.0 0 0.0 0 0.0 0 0.0 0
4 100023 1499902 1.000000 8 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
104302 456244 2181926 0.878049 36 0.0 0 0.121951 5 0.0 0 0.0 0 0.0 0 0.0 0
104303 456246 1079732 1.000000 8 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0
104304 456247 1595171 1.000000 95 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0
104305 456248 2743495 1.000000 23 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0
104306 456250 1794451 1.000000 12 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0

104307 rows × 16 columns

credit_card_balance_agg = pd.merge(credit_card_balance_numeric_agg, credit_card_balance_categorical_agg, 
    on = ['SK_ID_CURR', 'SK_ID_PREV'],
    how = 'left'
)
credit_card_balance_agg_by_client = agg_numeric(credit_card_balance_agg, 'SK_ID_CURR', exclude_columns=['SK_ID_PREV'])
credit_card_balance_agg_by_client.head()
SK_ID_CURR CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN_MIN CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN_MAX CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN_MEAN CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN_SUM CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX_MIN CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX_MAX CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX_MEAN CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX_SUM CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN_MIN ... CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SENT PROPOSAL_SUM_MEAN CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SENT PROPOSAL_SUM_SUM CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SIGNED_MEAN_MIN CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SIGNED_MEAN_MAX CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SIGNED_MEAN_MEAN CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SIGNED_MEAN_SUM CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SIGNED_SUM_MIN CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SIGNED_SUM_MAX CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SIGNED_SUM_MEAN CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SIGNED_SUM_SUM
0 100006 -6 -6 -6.0 -6 -1 -1 -1.0 -1 -3.5 ... 0.0 0 0.0 0.0 0.0 0.0 0 0 0.0 0
1 100011 -75 -75 -75.0 -75 -2 -2 -2.0 -2 -38.5 ... 0.0 0 0.0 0.0 0.0 0.0 0 0 0.0 0
2 100013 -96 -96 -96.0 -96 -1 -1 -1.0 -1 -48.5 ... 0.0 0 0.0 0.0 0.0 0.0 0 0 0.0 0
3 100021 -18 -18 -18.0 -18 -2 -2 -2.0 -2 -10.0 ... 0.0 0 0.0 0.0 0.0 0.0 0 0 0.0 0
4 100023 -11 -11 -11.0 -11 -4 -4 -4.0 -4 -7.5 ... 0.0 0 0.0 0.0 0.0 0.0 0 0 0.0 0

5 rows × 377 columns

credit_card_balance_agg_by_client.to_feather('checkpoints/02_credit_balance_agg.feather')
del credit_card_balance, credit_card_balance_numeric_agg, credit_card_balance_categorical_agg, credit_card_balance_agg
gc.collect()
0
print(train.shape, test.shape)
(307511, 764) (48744, 763)
missing_values_table(train)
Missing Values % of total values
COMMONAREA_MEDI 214865 69.872297
NONLIVINGAPARTMENTS_MEDI 213514 69.432963
FLOORSMIN_MEDI 208642 67.848630
YEARS_BUILD_MEDI 204488 66.497784
OWN_CAR_AGE 202929 65.990810
... ... ...
DEF_60_CNT_SOCIAL_CIRCLE 1021 0.332021
EXT_SOURCE_2 660 0.214626
AMT_ANNUITY 12 0.003902
CNT_FAM_MEMBERS 2 0.000650
DAYS_LAST_PHONE_CHANGE 1 0.000325

591 rows × 2 columns

train = pd.read_feather('checkpoints/01_train_app_base.feather')
test = pd.read_feather('checkpoints/01_test_app_base.feather')
credit_card_balance_agg = pd.read_feather('checkpoints/02_credit_balance_agg.feather')

train = pd.merge(train, credit_card_balance_agg, on='SK_ID_CURR', how='left')
test = pd.merge(test, credit_card_balance_agg, on='SK_ID_CURR', how='left')

print(f'train: {train.shape}, test: {test.shape}')
train, test = feature_select(train, test)
print(f'train: {train.shape}, test: {test.shape}')

train.to_feather('checkpoints/04_train_app_credit_cleaned.feather')
test.to_feather('checkpoints/04_test_app_credit_cleaned.feather')
train: (307511, 619), test: (48744, 618)
remove high missing cols.  (307511, 617) (48744, 617)
align train and test. (307511, 617) (48744, 617)
remove high corr cols. (307511, 264) (48744, 263)
train: (307511, 264), test: (48744, 263)
del credit_card_balance_agg
gc.collect()
0

引入 pos_cash_balance表#

pos_cash_balance = pd.read_csv('data/pos_cash_balance.csv')
pos_cash_balance_numeric_agg = agg_numeric(pos_cash_balance, ['SK_ID_CURR', 'SK_ID_PREV'], 'pos')
pos_cash_balance_numeric_agg
SK_ID_CURR SK_ID_PREV POS_MONTHS_BALANCE_MIN POS_MONTHS_BALANCE_MAX POS_MONTHS_BALANCE_MEAN POS_MONTHS_BALANCE_SUM POS_CNT_INSTALMENT_MIN POS_CNT_INSTALMENT_MAX POS_CNT_INSTALMENT_MEAN POS_CNT_INSTALMENT_SUM ... POS_CNT_INSTALMENT_FUTURE_MEAN POS_CNT_INSTALMENT_FUTURE_SUM POS_SK_DPD_MIN POS_SK_DPD_MAX POS_SK_DPD_MEAN POS_SK_DPD_SUM POS_SK_DPD_DEF_MIN POS_SK_DPD_DEF_MAX POS_SK_DPD_DEF_MEAN POS_SK_DPD_DEF_SUM
0 100001 1369693 -57 -53 -55.0 -275 4.0 4.0 4.000000 20.0 ... 2.000000 10.0 0 0 0.000000 0 0 0 0.000000 0
1 100001 1851984 -96 -93 -94.5 -378 4.0 4.0 4.000000 16.0 ... 0.750000 3.0 0 7 1.750000 7 0 7 1.750000 7
2 100002 1038818 -19 -1 -10.0 -190 24.0 24.0 24.000000 456.0 ... 15.000000 285.0 0 0 0.000000 0 0 0 0.000000 0
3 100003 1810518 -25 -18 -21.5 -172 7.0 12.0 11.375000 91.0 ... 7.875000 63.0 0 0 0.000000 0 0 0 0.000000 0
4 100003 2396755 -77 -66 -71.5 -858 12.0 12.0 12.000000 144.0 ... 6.500000 78.0 0 0 0.000000 0 0 0 0.000000 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
936320 456255 1359084 -15 -7 -11.0 -99 8.0 12.0 11.555556 104.0 ... 7.555556 68.0 0 0 0.000000 0 0 0 0.000000 0
936321 456255 1743609 -33 -23 -28.0 -308 10.0 12.0 11.818182 130.0 ... 6.818182 75.0 0 5 0.454545 5 0 5 0.454545 5
936322 456255 2073384 -21 -17 -19.0 -95 3.0 24.0 15.600000 78.0 ... 13.800000 69.0 0 0 0.000000 0 0 0 0.000000 0
936323 456255 2631384 -26 -2 -14.0 -350 24.0 36.0 35.520000 888.0 ... 23.520000 588.0 0 0 0.000000 0 0 0 0.000000 0
936324 456255 2729207 -16 -13 -14.5 -58 3.0 6.0 4.500000 18.0 ... 2.750000 11.0 0 0 0.000000 0 0 0 0.000000 0

936325 rows × 22 columns

pos_cash_balance_categorical_agg = agg_categorical(pos_cash_balance, ['SK_ID_CURR', 'SK_ID_PREV'], 'pos')
pos_cash_balance_categorical_agg
SK_ID_CURR SK_ID_PREV pos_NAME_CONTRACT_STATUS_Active_mean pos_NAME_CONTRACT_STATUS_Active_sum pos_NAME_CONTRACT_STATUS_Amortized debt_mean pos_NAME_CONTRACT_STATUS_Amortized debt_sum pos_NAME_CONTRACT_STATUS_Approved_mean pos_NAME_CONTRACT_STATUS_Approved_sum pos_NAME_CONTRACT_STATUS_Canceled_mean pos_NAME_CONTRACT_STATUS_Canceled_sum pos_NAME_CONTRACT_STATUS_Completed_mean pos_NAME_CONTRACT_STATUS_Completed_sum pos_NAME_CONTRACT_STATUS_Demand_mean pos_NAME_CONTRACT_STATUS_Demand_sum pos_NAME_CONTRACT_STATUS_Returned to the store_mean pos_NAME_CONTRACT_STATUS_Returned to the store_sum pos_NAME_CONTRACT_STATUS_Signed_mean pos_NAME_CONTRACT_STATUS_Signed_sum pos_NAME_CONTRACT_STATUS_XNA_mean pos_NAME_CONTRACT_STATUS_XNA_sum
0 100001 1369693 0.800000 4 0.0 0 0.0 0 0.0 0 0.200000 1 0.0 0 0.0 0 0.0 0 0.0 0
1 100001 1851984 0.750000 3 0.0 0 0.0 0 0.0 0 0.250000 1 0.0 0 0.0 0 0.0 0 0.0 0
2 100002 1038818 1.000000 19 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0
3 100003 1810518 0.875000 7 0.0 0 0.0 0 0.0 0 0.125000 1 0.0 0 0.0 0 0.0 0 0.0 0
4 100003 2396755 1.000000 12 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
936320 456255 1359084 0.888889 8 0.0 0 0.0 0 0.0 0 0.111111 1 0.0 0 0.0 0 0.0 0 0.0 0
936321 456255 1743609 0.909091 10 0.0 0 0.0 0 0.0 0 0.090909 1 0.0 0 0.0 0 0.0 0 0.0 0
936322 456255 2073384 0.800000 4 0.0 0 0.0 0 0.0 0 0.200000 1 0.0 0 0.0 0 0.0 0 0.0 0
936323 456255 2631384 0.960000 24 0.0 0 0.0 0 0.0 0 0.040000 1 0.0 0 0.0 0 0.0 0 0.0 0
936324 456255 2729207 0.750000 3 0.0 0 0.0 0 0.0 0 0.250000 1 0.0 0 0.0 0 0.0 0 0.0 0

936325 rows × 20 columns

pos_cash_balance_agg = pd.merge(pos_cash_balance_categorical_agg, pos_cash_balance_numeric_agg, 
    on = ['SK_ID_CURR', 'SK_ID_PREV'],
    how = 'left'
)
pos_cash_balance_agg_by_client = agg_numeric(pos_cash_balance_agg, 'SK_ID_CURR', exclude_columns=['SK_ID_PREV'])
pos_cash_balance_agg_by_client.to_feather('checkpoints/02_pos_agg.feather')
del pos_cash_balance, pos_cash_balance_numeric_agg, pos_cash_balance_categorical_agg, pos_cash_balance_agg
gc.collect()
0
train = pd.read_feather('checkpoints/01_train_app_base.feather')
test = pd.read_feather('checkpoints/01_test_app_base.feather')
pos_cash_balance_agg = pd.read_feather('checkpoints/02_pos_agg.feather')

train = pd.merge(train, pos_cash_balance_agg, on='SK_ID_CURR', how='left')
test = pd.merge(test, pos_cash_balance_agg, on='SK_ID_CURR', how='left')

print(f'train: {train.shape}, test: {test.shape}')
train, test = feature_select(train, test)
print(f'train: {train.shape}, test: {test.shape}')

train.to_feather('checkpoints/04_train_app_pos_cleaned.feather')
test.to_feather('checkpoints/04_test_app_pos_cleaned.feather')
train: (307511, 395), test: (48744, 394)
remove high missing cols.  (307511, 393) (48744, 393)
align train and test. (307511, 393) (48744, 393)
remove high corr cols. (307511, 288) (48744, 287)
train: (307511, 288), test: (48744, 287)

installments_payments表#

installments_payments = pd.read_csv('data/installments_payments.csv')
installments_payments.dtypes
SK_ID_PREV                  int64
SK_ID_CURR                  int64
NUM_INSTALMENT_VERSION    float64
NUM_INSTALMENT_NUMBER       int64
DAYS_INSTALMENT           float64
DAYS_ENTRY_PAYMENT        float64
AMT_INSTALMENT            float64
AMT_PAYMENT               float64
dtype: object

没有分类特征,都是数值的

installments_payments_numeric_agg = agg_numeric(installments_payments, ['SK_ID_CURR', 'SK_ID_PREV'], 'installments')
installments_payments_agg = installments_payments_numeric_agg
installments_payments_agg_by_client = agg_numeric(installments_payments_agg, 'SK_ID_CURR', exclude_columns=['SK_ID_PREV'])
installments_payments_agg_by_client.head()
SK_ID_CURR INSTALLMENTS_NUM_INSTALMENT_VERSION_MIN_MIN INSTALLMENTS_NUM_INSTALMENT_VERSION_MIN_MAX INSTALLMENTS_NUM_INSTALMENT_VERSION_MIN_MEAN INSTALLMENTS_NUM_INSTALMENT_VERSION_MIN_SUM INSTALLMENTS_NUM_INSTALMENT_VERSION_MAX_MIN INSTALLMENTS_NUM_INSTALMENT_VERSION_MAX_MAX INSTALLMENTS_NUM_INSTALMENT_VERSION_MAX_MEAN INSTALLMENTS_NUM_INSTALMENT_VERSION_MAX_SUM INSTALLMENTS_NUM_INSTALMENT_VERSION_MEAN_MIN ... INSTALLMENTS_AMT_PAYMENT_MAX_MEAN INSTALLMENTS_AMT_PAYMENT_MAX_SUM INSTALLMENTS_AMT_PAYMENT_MEAN_MIN INSTALLMENTS_AMT_PAYMENT_MEAN_MAX INSTALLMENTS_AMT_PAYMENT_MEAN_MEAN INSTALLMENTS_AMT_PAYMENT_MEAN_SUM INSTALLMENTS_AMT_PAYMENT_SUM_MIN INSTALLMENTS_AMT_PAYMENT_SUM_MAX INSTALLMENTS_AMT_PAYMENT_SUM_MEAN INSTALLMENTS_AMT_PAYMENT_SUM_SUM
0 100001 1.0 1.0 1.0 2.0 1.0 2.0 1.500000 3.0 1.000000 ... 10689.975 21379.950 3981.675000 7312.725000 5647.200000 11294.400000 11945.025 29250.900 20597.9625 41195.925
1 100002 1.0 1.0 1.0 1.0 2.0 2.0 2.000000 2.0 1.052632 ... 53093.745 53093.745 11559.247105 11559.247105 11559.247105 11559.247105 219625.695 219625.695 219625.6950 219625.695
2 100003 1.0 1.0 1.0 3.0 1.0 2.0 1.333333 4.0 1.000000 ... 210713.445 632140.335 6731.115000 164425.332857 78558.479286 235675.437857 80773.380 1150977.330 539621.5500 1618864.650
3 100004 1.0 1.0 1.0 1.0 2.0 2.0 2.000000 2.0 1.333333 ... 10573.965 10573.965 7096.155000 7096.155000 7096.155000 7096.155000 21288.465 21288.465 21288.4650 21288.465
4 100005 1.0 1.0 1.0 1.0 2.0 2.0 2.000000 2.0 1.111111 ... 17656.245 17656.245 6240.205000 6240.205000 6240.205000 6240.205000 56161.845 56161.845 56161.8450 56161.845

5 rows × 97 columns

del installments_payments, installments_payments_numeric_agg, installments_payments_agg
gc.collect()
0
installments_payments_agg_by_client.to_feather('checkpoints/02_installments_agg.feather')
train = pd.read_feather('checkpoints/01_train_app_base.feather')
test = pd.read_feather('checkpoints/01_test_app_base.feather')
installments_payments_agg = pd.read_feather('checkpoints/02_installments_agg.feather')

train = pd.merge(train, installments_payments_agg, on='SK_ID_CURR', how='left')
test = pd.merge(test, installments_payments_agg, on='SK_ID_CURR', how='left')

print(f'train: {train.shape}, test: {test.shape}')
train, test = feature_select(train, test)
print(f'train: {train.shape}, test: {test.shape}')

train.to_feather('checkpoints/04_train_app_installments_cleaned.feather')
test.to_feather('checkpoints/04_test_app_installments_cleaned.feather')
train: (307511, 339), test: (48744, 338)
remove high missing cols.  (307511, 337) (48744, 337)
align train and test. (307511, 337) (48744, 337)
remove high corr cols. (307511, 242) (48744, 241)
train: (307511, 242), test: (48744, 241)

合并所有app-子表#

# 仅查看appbase shape
train = pd.read_feather('checkpoints/01_train_app_base.feather')
test = pd.read_feather('checkpoints/01_test_app_base.feather')
print('app', train.shape, test.shape)
app (307511, 243) (48744, 242)
train_app_previous = pd.read_feather('checkpoints/04_train_app_previous_cleaned.feather')
test_app_previous = pd.read_feather('checkpoints/04_test_app_previous_cleaned.feather')
print('app_previous', train_app_previous.shape, test_app_previous.shape)

train_app_credit = pd.read_feather('checkpoints/04_train_app_credit_cleaned.feather')
test_app_credit= pd.read_feather('checkpoints/04_test_app_credit_cleaned.feather')
print('app_credit', train_app_credit.shape, test_app_credit.shape)

train_app_pos = pd.read_feather('checkpoints/04_train_app_pos_cleaned.feather')
test_app_pos = pd.read_feather('checkpoints/04_test_app_pos_cleaned.feather')
print('app_pos', train_app_pos.shape, test_app_pos.shape)

train_app_install = pd.read_feather('checkpoints/04_train_app_installments_cleaned.feather')
test_app_install = pd.read_feather('checkpoints/04_test_app_installments_cleaned.feather')
print('app_installments', train_app_install.shape, test_app_install.shape)

train_app_bureau = pd.read_feather('checkpoints/04_train_app_bureau_balance_bureau_cleaned.feather')
test_app_bureau = pd.read_feather('checkpoints/04_test_app_bureau_balance_bureau_cleaned.feather')
print('app_bureau', train_app_bureau.shape, test_app_bureau.shape)
app_previous (307511, 764) (48744, 763)
app_credit (307511, 264) (48744, 263)
app_pos (307511, 288) (48744, 287)
app_installments (307511, 242) (48744, 241)
app_bureau (307511, 306) (48744, 305)
from functools import  reduce
def merge_dataframes(dfs, key):
    res = dfs[0].copy()
    # 不要合并重复的列
    for i,df in enumerate(dfs[1:], 1):
        unique_cols = [col for col in df.columns if col not in res.columns] + [key]
        res = pd.merge(res, df[unique_cols], on=key, how='left')
    return res
train_dfs = [train_app_previous, train_app_credit, train_app_pos, train_app_install, train_app_bureau]
test_dfs = [test_app_previous, test_app_credit, test_app_pos, test_app_install, test_app_bureau]
train = merge_dataframes(train_dfs, key='SK_ID_CURR')
test = merge_dataframes(test_dfs, key='SK_ID_CURR')
print(train.shape, test.shape)
(307511, 1066) (48744, 1065)
train.to_feather('checkpoints/05_train_merged_v1.feather')
test.to_feather('checkpoints/05_test_merged_v1.feather')

modeling#

train = pd.read_feather('checkpoints/05_train_merged_v1.feather')
test = pd.read_feather('checkpoints/05_test_merged_v1.feather')

train_labels = train['TARGET']
train_ids = train['SK_ID_CURR']
test_ids = test['SK_ID_CURR']
train_features = train.drop(columns=['TARGET', 'SK_ID_CURR'])
test_features = test.drop(columns=['SK_ID_CURR'])
print(train.shape, test.shape)
(307511, 1066) (48744, 1065)
list(train.columns)
['CNT_CHILDREN',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'AMT_ANNUITY',
 'REGION_POPULATION_RELATIVE',
 'DAYS_BIRTH',
 'DAYS_REGISTRATION',
 'DAYS_ID_PUBLISH',
 'OWN_CAR_AGE',
 'FLAG_MOBIL',
 'FLAG_WORK_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_PHONE',
 'FLAG_EMAIL',
 'CNT_FAM_MEMBERS',
 'REGION_RATING_CLIENT_W_CITY',
 'HOUR_APPR_PROCESS_START',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'BASEMENTAREA_MEDI',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'YEARS_BUILD_MEDI',
 'COMMONAREA_MEDI',
 'ELEVATORS_MEDI',
 'ENTRANCES_MEDI',
 'FLOORSMAX_MEDI',
 'FLOORSMIN_MEDI',
 'LANDAREA_MEDI',
 'NONLIVINGAPARTMENTS_MEDI',
 'NONLIVINGAREA_MEDI',
 'TOTALAREA_MODE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'DAYS_LAST_PHONE_CHANGE',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'NAME_CONTRACT_TYPE_Revolving loans',
 'CODE_GENDER_M',
 'FLAG_OWN_CAR_Y',
 'FLAG_OWN_REALTY_Y',
 'NAME_TYPE_SUITE_Children',
 'NAME_TYPE_SUITE_Family',
 'NAME_TYPE_SUITE_Group of people',
 'NAME_TYPE_SUITE_Other_A',
 'NAME_TYPE_SUITE_Other_B',
 'NAME_TYPE_SUITE_Spouse, partner',
 'NAME_TYPE_SUITE_Unaccompanied',
 'NAME_INCOME_TYPE_Businessman',
 'NAME_INCOME_TYPE_Commercial associate',
 'NAME_INCOME_TYPE_State servant',
 'NAME_INCOME_TYPE_Student',
 'NAME_INCOME_TYPE_Unemployed',
 'NAME_INCOME_TYPE_Working',
 'NAME_EDUCATION_TYPE_Academic degree',
 'NAME_EDUCATION_TYPE_Higher education',
 'NAME_EDUCATION_TYPE_Incomplete higher',
 'NAME_EDUCATION_TYPE_Lower secondary',
 'NAME_EDUCATION_TYPE_Secondary / secondary special',
 'NAME_FAMILY_STATUS_Civil marriage',
 'NAME_FAMILY_STATUS_Married',
 'NAME_FAMILY_STATUS_Separated',
 'NAME_FAMILY_STATUS_Single / not married',
 'NAME_FAMILY_STATUS_Widow',
 'NAME_HOUSING_TYPE_Co-op apartment',
 'NAME_HOUSING_TYPE_House / apartment',
 'NAME_HOUSING_TYPE_Municipal apartment',
 'NAME_HOUSING_TYPE_Office apartment',
 'NAME_HOUSING_TYPE_Rented apartment',
 'NAME_HOUSING_TYPE_With parents',
 'OCCUPATION_TYPE_Accountants',
 'OCCUPATION_TYPE_Cleaning staff',
 'OCCUPATION_TYPE_Cooking staff',
 'OCCUPATION_TYPE_Core staff',
 'OCCUPATION_TYPE_Drivers',
 'OCCUPATION_TYPE_HR staff',
 'OCCUPATION_TYPE_High skill tech staff',
 'OCCUPATION_TYPE_IT staff',
 'OCCUPATION_TYPE_Laborers',
 'OCCUPATION_TYPE_Low-skill Laborers',
 'OCCUPATION_TYPE_Managers',
 'OCCUPATION_TYPE_Medicine staff',
 'OCCUPATION_TYPE_Private service staff',
 'OCCUPATION_TYPE_Realty agents',
 'OCCUPATION_TYPE_Sales staff',
 'OCCUPATION_TYPE_Secretaries',
 'OCCUPATION_TYPE_Security staff',
 'OCCUPATION_TYPE_Waiters/barmen staff',
 'WEEKDAY_APPR_PROCESS_START_FRIDAY',
 'WEEKDAY_APPR_PROCESS_START_MONDAY',
 'WEEKDAY_APPR_PROCESS_START_SATURDAY',
 'WEEKDAY_APPR_PROCESS_START_SUNDAY',
 'WEEKDAY_APPR_PROCESS_START_THURSDAY',
 'WEEKDAY_APPR_PROCESS_START_TUESDAY',
 'WEEKDAY_APPR_PROCESS_START_WEDNESDAY',
 'ORGANIZATION_TYPE_Advertising',
 'ORGANIZATION_TYPE_Agriculture',
 'ORGANIZATION_TYPE_Bank',
 'ORGANIZATION_TYPE_Business Entity Type 1',
 'ORGANIZATION_TYPE_Business Entity Type 2',
 'ORGANIZATION_TYPE_Business Entity Type 3',
 'ORGANIZATION_TYPE_Cleaning',
 'ORGANIZATION_TYPE_Construction',
 'ORGANIZATION_TYPE_Culture',
 'ORGANIZATION_TYPE_Electricity',
 'ORGANIZATION_TYPE_Emergency',
 'ORGANIZATION_TYPE_Government',
 'ORGANIZATION_TYPE_Hotel',
 'ORGANIZATION_TYPE_Housing',
 'ORGANIZATION_TYPE_Industry: type 1',
 'ORGANIZATION_TYPE_Industry: type 10',
 'ORGANIZATION_TYPE_Industry: type 11',
 'ORGANIZATION_TYPE_Industry: type 12',
 'ORGANIZATION_TYPE_Industry: type 13',
 'ORGANIZATION_TYPE_Industry: type 2',
 'ORGANIZATION_TYPE_Industry: type 3',
 'ORGANIZATION_TYPE_Industry: type 4',
 'ORGANIZATION_TYPE_Industry: type 5',
 'ORGANIZATION_TYPE_Industry: type 6',
 'ORGANIZATION_TYPE_Industry: type 7',
 'ORGANIZATION_TYPE_Industry: type 8',
 'ORGANIZATION_TYPE_Industry: type 9',
 'ORGANIZATION_TYPE_Insurance',
 'ORGANIZATION_TYPE_Kindergarten',
 'ORGANIZATION_TYPE_Legal Services',
 'ORGANIZATION_TYPE_Medicine',
 'ORGANIZATION_TYPE_Military',
 'ORGANIZATION_TYPE_Mobile',
 'ORGANIZATION_TYPE_Other',
 'ORGANIZATION_TYPE_Police',
 'ORGANIZATION_TYPE_Postal',
 'ORGANIZATION_TYPE_Realtor',
 'ORGANIZATION_TYPE_Religion',
 'ORGANIZATION_TYPE_Restaurant',
 'ORGANIZATION_TYPE_School',
 'ORGANIZATION_TYPE_Security',
 'ORGANIZATION_TYPE_Security Ministries',
 'ORGANIZATION_TYPE_Self-employed',
 'ORGANIZATION_TYPE_Services',
 'ORGANIZATION_TYPE_Telecom',
 'ORGANIZATION_TYPE_Trade: type 1',
 'ORGANIZATION_TYPE_Trade: type 2',
 'ORGANIZATION_TYPE_Trade: type 3',
 'ORGANIZATION_TYPE_Trade: type 4',
 'ORGANIZATION_TYPE_Trade: type 5',
 'ORGANIZATION_TYPE_Trade: type 6',
 'ORGANIZATION_TYPE_Trade: type 7',
 'ORGANIZATION_TYPE_Transport: type 1',
 'ORGANIZATION_TYPE_Transport: type 2',
 'ORGANIZATION_TYPE_Transport: type 3',
 'ORGANIZATION_TYPE_Transport: type 4',
 'ORGANIZATION_TYPE_University',
 'ORGANIZATION_TYPE_XNA',
 'FONDKAPREMONT_MODE_not specified',
 'FONDKAPREMONT_MODE_org spec account',
 'FONDKAPREMONT_MODE_reg oper account',
 'FONDKAPREMONT_MODE_reg oper spec account',
 'HOUSETYPE_MODE_specific housing',
 'HOUSETYPE_MODE_terraced house',
 'WALLSMATERIAL_MODE_Block',
 'WALLSMATERIAL_MODE_Mixed',
 'WALLSMATERIAL_MODE_Monolithic',
 'WALLSMATERIAL_MODE_Others',
 'WALLSMATERIAL_MODE_Panel',
 'WALLSMATERIAL_MODE_Stone, brick',
 'WALLSMATERIAL_MODE_Wooden',
 'EMERGENCYSTATE_MODE_No',
 'EMERGENCYSTATE_MODE_Yes',
 'PREVIOUS_AMT_ANNUITY_MEAN_MIN',
 'PREVIOUS_AMT_ANNUITY_SUM_MIN',
 'PREVIOUS_AMT_ANNUITY_SUM_MAX',
 'PREVIOUS_AMT_ANNUITY_SUM_MEAN',
 'PREVIOUS_AMT_DOWN_PAYMENT_MEAN_MIN',
 'PREVIOUS_AMT_DOWN_PAYMENT_MEAN_MEAN',
 'PREVIOUS_AMT_DOWN_PAYMENT_SUM_MIN',
 'PREVIOUS_AMT_DOWN_PAYMENT_SUM_MEAN',
 'PREVIOUS_AMT_DOWN_PAYMENT_SUM_SUM',
 'PREVIOUS_AMT_GOODS_PRICE_MEAN_MIN',
 'PREVIOUS_AMT_GOODS_PRICE_SUM_MIN',
 'PREVIOUS_AMT_GOODS_PRICE_SUM_MAX',
 'PREVIOUS_AMT_GOODS_PRICE_SUM_MEAN',
 'PREVIOUS_AMT_GOODS_PRICE_SUM_SUM',
 'PREVIOUS_HOUR_APPR_PROCESS_START_SUM_MIN',
 'PREVIOUS_HOUR_APPR_PROCESS_START_SUM_MAX',
 'PREVIOUS_HOUR_APPR_PROCESS_START_SUM_MEAN',
 'PREVIOUS_NFLAG_LAST_APPL_IN_DAY_SUM_MIN',
 'PREVIOUS_NFLAG_LAST_APPL_IN_DAY_SUM_MAX',
 'PREVIOUS_NFLAG_LAST_APPL_IN_DAY_SUM_MEAN',
 'PREVIOUS_RATE_DOWN_PAYMENT_MEAN_MIN',
 'PREVIOUS_RATE_DOWN_PAYMENT_MEAN_MEAN',
 'PREVIOUS_RATE_DOWN_PAYMENT_SUM_MIN',
 'PREVIOUS_RATE_DOWN_PAYMENT_SUM_MAX',
 'PREVIOUS_RATE_DOWN_PAYMENT_SUM_MEAN',
 'PREVIOUS_RATE_DOWN_PAYMENT_SUM_SUM',
 'PREVIOUS_RATE_INTEREST_PRIMARY_SUM_MIN',
 'PREVIOUS_RATE_INTEREST_PRIMARY_SUM_MEAN',
 'PREVIOUS_RATE_INTEREST_PRIMARY_SUM_SUM',
 'PREVIOUS_RATE_INTEREST_PRIVILEGED_SUM_MIN',
 'PREVIOUS_RATE_INTEREST_PRIVILEGED_SUM_MEAN',
 'PREVIOUS_RATE_INTEREST_PRIVILEGED_SUM_SUM',
 'PREVIOUS_DAYS_DECISION_SUM_MIN',
 'PREVIOUS_DAYS_DECISION_SUM_MAX',
 'PREVIOUS_DAYS_DECISION_SUM_MEAN',
 'PREVIOUS_DAYS_DECISION_SUM_SUM',
 'PREVIOUS_SELLERPLACE_AREA_SUM_MEAN',
 'PREVIOUS_SELLERPLACE_AREA_SUM_SUM',
 'PREVIOUS_CNT_PAYMENT_MEAN_MIN',
 'PREVIOUS_CNT_PAYMENT_MEAN_MEAN',
 'PREVIOUS_CNT_PAYMENT_SUM_MIN',
 'PREVIOUS_CNT_PAYMENT_SUM_MAX',
 'PREVIOUS_CNT_PAYMENT_SUM_MEAN',
 'PREVIOUS_CNT_PAYMENT_SUM_SUM',
 'PREVIOUS_DAYS_FIRST_DRAWING_MEAN_MIN',
 'PREVIOUS_DAYS_FIRST_DRAWING_MEAN_MEAN',
 'PREVIOUS_DAYS_FIRST_DRAWING_SUM_MIN',
 'PREVIOUS_DAYS_FIRST_DRAWING_SUM_MAX',
 'PREVIOUS_DAYS_FIRST_DRAWING_SUM_MEAN',
 'PREVIOUS_DAYS_FIRST_DUE_MEAN_MIN',
 'PREVIOUS_DAYS_FIRST_DUE_SUM_MIN',
 'PREVIOUS_DAYS_FIRST_DUE_SUM_MEAN',
 'PREVIOUS_DAYS_FIRST_DUE_SUM_SUM',
 'PREVIOUS_DAYS_LAST_DUE_1ST_VERSION_SUM_MEAN',
 'PREVIOUS_DAYS_LAST_DUE_1ST_VERSION_SUM_SUM',
 'PREVIOUS_DAYS_TERMINATION_MEAN_MIN',
 'PREVIOUS_DAYS_TERMINATION_MEAN_MEAN',
 'PREVIOUS_DAYS_TERMINATION_SUM_MIN',
 'PREVIOUS_DAYS_TERMINATION_SUM_MAX',
 'PREVIOUS_DAYS_TERMINATION_SUM_MEAN',
 'PREVIOUS_DAYS_TERMINATION_SUM_SUM',
 'PREVIOUS_NFLAG_INSURED_ON_APPROVAL_MEAN_MIN',
 'PREVIOUS_NFLAG_INSURED_ON_APPROVAL_MEAN_MEAN',
 'PREVIOUS_NFLAG_INSURED_ON_APPROVAL_SUM_MIN',
 'PREVIOUS_NFLAG_INSURED_ON_APPROVAL_SUM_MAX',
 'PREVIOUS_NFLAG_INSURED_ON_APPROVAL_SUM_MEAN',
 'PREVIOUS_NFLAG_INSURED_ON_APPROVAL_SUM_SUM',
 'PREVIOUS_NAME_CONTRACT_TYPE_REVOLVING LOANS_SUM_SUM',
 'PREVIOUS_NAME_CONTRACT_TYPE_XNA_MEAN_MIN',
 'PREVIOUS_NAME_CONTRACT_TYPE_XNA_SUM_MIN',
 'PREVIOUS_NAME_CONTRACT_TYPE_XNA_SUM_MEAN',
 'PREVIOUS_NAME_CONTRACT_TYPE_XNA_SUM_SUM',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_FRIDAY_SUM_MIN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_FRIDAY_SUM_MAX',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_FRIDAY_SUM_MEAN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_FRIDAY_SUM_SUM',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_MONDAY_SUM_MIN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_MONDAY_SUM_MAX',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_MONDAY_SUM_MEAN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_MONDAY_SUM_SUM',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_SATURDAY_SUM_MIN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_SATURDAY_SUM_MAX',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_SATURDAY_SUM_MEAN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_SATURDAY_SUM_SUM',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_SUNDAY_SUM_MIN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_SUNDAY_SUM_MAX',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_SUNDAY_SUM_MEAN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_SUNDAY_SUM_SUM',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_THURSDAY_SUM_MIN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_THURSDAY_SUM_MAX',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_THURSDAY_SUM_MEAN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_THURSDAY_SUM_SUM',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_TUESDAY_SUM_MIN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_TUESDAY_SUM_MAX',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_TUESDAY_SUM_MEAN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_TUESDAY_SUM_SUM',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_WEDNESDAY_SUM_MIN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_WEDNESDAY_SUM_MAX',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_WEDNESDAY_SUM_MEAN',
 'PREVIOUS_WEEKDAY_APPR_PROCESS_START_WEDNESDAY_SUM_SUM',
 'PREVIOUS_FLAG_LAST_APPL_PER_CONTRACT_N_MEAN_MIN',
 'PREVIOUS_FLAG_LAST_APPL_PER_CONTRACT_N_SUM_MIN',
 'PREVIOUS_FLAG_LAST_APPL_PER_CONTRACT_N_SUM_SUM',
 'PREVIOUS_FLAG_LAST_APPL_PER_CONTRACT_Y_MEAN_MAX',
 'PREVIOUS_FLAG_LAST_APPL_PER_CONTRACT_Y_SUM_MIN',
 'PREVIOUS_FLAG_LAST_APPL_PER_CONTRACT_Y_SUM_MAX',
 'PREVIOUS_FLAG_LAST_APPL_PER_CONTRACT_Y_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUILDING A HOUSE OR AN ANNEX_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUILDING A HOUSE OR AN ANNEX_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUILDING A HOUSE OR AN ANNEX_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUILDING A HOUSE OR AN ANNEX_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUSINESS DEVELOPMENT_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUSINESS DEVELOPMENT_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUSINESS DEVELOPMENT_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A GARAGE_MEAN_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A GARAGE_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A GARAGE_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A GARAGE_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A HOLIDAY HOME / LAND_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A HOLIDAY HOME / LAND_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A HOLIDAY HOME / LAND_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A HOLIDAY HOME / LAND_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A HOME_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A HOME_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A HOME_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A HOME_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A NEW CAR_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A NEW CAR_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A NEW CAR_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A NEW CAR_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A USED CAR_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A USED CAR_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A USED CAR_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_BUYING A USED CAR_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_CAR REPAIRS_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_CAR REPAIRS_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_CAR REPAIRS_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_EDUCATION_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_EDUCATION_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_EDUCATION_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_EDUCATION_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_EVERYDAY EXPENSES_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_EVERYDAY EXPENSES_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_EVERYDAY EXPENSES_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_FURNITURE_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_FURNITURE_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_FURNITURE_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_FURNITURE_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_GASIFICATION / WATER SUPPLY_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_GASIFICATION / WATER SUPPLY_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_GASIFICATION / WATER SUPPLY_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_GASIFICATION / WATER SUPPLY_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_HOBBY_MEAN_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_HOBBY_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_HOBBY_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_HOBBY_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_JOURNEY_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_JOURNEY_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_JOURNEY_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_MEDICINE_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_MEDICINE_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_MEDICINE_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_MEDICINE_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_MONEY FOR A THIRD PERSON_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_MONEY FOR A THIRD PERSON_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_OTHER_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_OTHER_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_OTHER_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_OTHER_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_PAYMENTS ON OTHER LOANS_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_PAYMENTS ON OTHER LOANS_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_PAYMENTS ON OTHER LOANS_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_PAYMENTS ON OTHER LOANS_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_PURCHASE OF ELECTRONIC EQUIPMENT_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_PURCHASE OF ELECTRONIC EQUIPMENT_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_PURCHASE OF ELECTRONIC EQUIPMENT_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_PURCHASE OF ELECTRONIC EQUIPMENT_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_REFUSAL TO NAME THE GOAL_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_REFUSAL TO NAME THE GOAL_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_REPAIRS_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_REPAIRS_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_REPAIRS_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_REPAIRS_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_URGENT NEEDS_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_URGENT NEEDS_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_URGENT NEEDS_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_URGENT NEEDS_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_WEDDING / GIFT / HOLIDAY_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_WEDDING / GIFT / HOLIDAY_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_WEDDING / GIFT / HOLIDAY_SUM_MEAN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_WEDDING / GIFT / HOLIDAY_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_XAP_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_XAP_SUM_SUM',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_XNA_SUM_MIN',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_XNA_SUM_MAX',
 'PREVIOUS_NAME_CASH_LOAN_PURPOSE_XNA_SUM_MEAN',
 'PREVIOUS_NAME_CONTRACT_STATUS_APPROVED_SUM_MIN',
 'PREVIOUS_NAME_CONTRACT_STATUS_APPROVED_SUM_MAX',
 'PREVIOUS_NAME_CONTRACT_STATUS_APPROVED_SUM_MEAN',
 'PREVIOUS_NAME_CONTRACT_STATUS_APPROVED_SUM_SUM',
 'PREVIOUS_NAME_CONTRACT_STATUS_CANCELED_SUM_MIN',
 'PREVIOUS_NAME_CONTRACT_STATUS_CANCELED_SUM_MAX',
 'PREVIOUS_NAME_CONTRACT_STATUS_CANCELED_SUM_MEAN',
 'PREVIOUS_NAME_CONTRACT_STATUS_REFUSED_SUM_MIN',
 'PREVIOUS_NAME_CONTRACT_STATUS_REFUSED_SUM_SUM',
 'PREVIOUS_NAME_PAYMENT_TYPE_CASH THROUGH THE BANK_SUM_SUM',
 'PREVIOUS_NAME_PAYMENT_TYPE_CASHLESS FROM THE ACCOUNT OF THE EMPLOYER_SUM_MIN',
 'PREVIOUS_NAME_PAYMENT_TYPE_CASHLESS FROM THE ACCOUNT OF THE EMPLOYER_SUM_MEAN',
 'PREVIOUS_NAME_PAYMENT_TYPE_CASHLESS FROM THE ACCOUNT OF THE EMPLOYER_SUM_SUM',
 'PREVIOUS_NAME_PAYMENT_TYPE_NON-CASH FROM YOUR ACCOUNT_SUM_MIN',
 'PREVIOUS_NAME_PAYMENT_TYPE_NON-CASH FROM YOUR ACCOUNT_SUM_MAX',
 'PREVIOUS_NAME_PAYMENT_TYPE_NON-CASH FROM YOUR ACCOUNT_SUM_MEAN',
 'PREVIOUS_NAME_PAYMENT_TYPE_NON-CASH FROM YOUR ACCOUNT_SUM_SUM',
 'PREVIOUS_NAME_PAYMENT_TYPE_XNA_SUM_MIN',
 'PREVIOUS_NAME_PAYMENT_TYPE_XNA_SUM_MAX',
 'PREVIOUS_NAME_PAYMENT_TYPE_XNA_SUM_MEAN',
 'PREVIOUS_CODE_REJECT_REASON_CLIENT_SUM_MIN',
 'PREVIOUS_CODE_REJECT_REASON_CLIENT_SUM_MEAN',
 'PREVIOUS_CODE_REJECT_REASON_CLIENT_SUM_SUM',
 'PREVIOUS_CODE_REJECT_REASON_HC_SUM_MIN',
 'PREVIOUS_CODE_REJECT_REASON_HC_SUM_MAX',
 'PREVIOUS_CODE_REJECT_REASON_HC_SUM_MEAN',
 'PREVIOUS_CODE_REJECT_REASON_HC_SUM_SUM',
 'PREVIOUS_CODE_REJECT_REASON_LIMIT_SUM_MIN',
 'PREVIOUS_CODE_REJECT_REASON_LIMIT_SUM_MAX',
 'PREVIOUS_CODE_REJECT_REASON_LIMIT_SUM_MEAN',
 'PREVIOUS_CODE_REJECT_REASON_LIMIT_SUM_SUM',
 'PREVIOUS_CODE_REJECT_REASON_SCO_SUM_MIN',
 'PREVIOUS_CODE_REJECT_REASON_SCO_SUM_MAX',
 'PREVIOUS_CODE_REJECT_REASON_SCO_SUM_MEAN',
 'PREVIOUS_CODE_REJECT_REASON_SCO_SUM_SUM',
 'PREVIOUS_CODE_REJECT_REASON_SCOFR_MEAN_MIN',
 'PREVIOUS_CODE_REJECT_REASON_SCOFR_SUM_MIN',
 'PREVIOUS_CODE_REJECT_REASON_SCOFR_SUM_MAX',
 'PREVIOUS_CODE_REJECT_REASON_SCOFR_SUM_MEAN',
 'PREVIOUS_CODE_REJECT_REASON_SCOFR_SUM_SUM',
 'PREVIOUS_CODE_REJECT_REASON_SYSTEM_SUM_MIN',
 'PREVIOUS_CODE_REJECT_REASON_SYSTEM_SUM_MAX',
 'PREVIOUS_CODE_REJECT_REASON_SYSTEM_SUM_MEAN',
 'PREVIOUS_CODE_REJECT_REASON_SYSTEM_SUM_SUM',
 'PREVIOUS_CODE_REJECT_REASON_VERIF_SUM_MIN',
 'PREVIOUS_CODE_REJECT_REASON_VERIF_SUM_MAX',
 'PREVIOUS_CODE_REJECT_REASON_VERIF_SUM_MEAN',
 'PREVIOUS_CODE_REJECT_REASON_VERIF_SUM_SUM',
 'PREVIOUS_CODE_REJECT_REASON_XAP_SUM_MIN',
 'PREVIOUS_CODE_REJECT_REASON_XAP_SUM_MAX',
 'PREVIOUS_CODE_REJECT_REASON_XAP_SUM_MEAN',
 'PREVIOUS_CODE_REJECT_REASON_XNA_SUM_MIN',
 'PREVIOUS_CODE_REJECT_REASON_XNA_SUM_MAX',
 'PREVIOUS_CODE_REJECT_REASON_XNA_SUM_MEAN',
 'PREVIOUS_CODE_REJECT_REASON_XNA_SUM_SUM',
 'PREVIOUS_NAME_TYPE_SUITE_CHILDREN_SUM_MIN',
 'PREVIOUS_NAME_TYPE_SUITE_CHILDREN_SUM_MAX',
 'PREVIOUS_NAME_TYPE_SUITE_CHILDREN_SUM_MEAN',
 'PREVIOUS_NAME_TYPE_SUITE_CHILDREN_SUM_SUM',
 'PREVIOUS_NAME_TYPE_SUITE_FAMILY_SUM_MIN',
 'PREVIOUS_NAME_TYPE_SUITE_FAMILY_SUM_MAX',
 'PREVIOUS_NAME_TYPE_SUITE_FAMILY_SUM_MEAN',
 'PREVIOUS_NAME_TYPE_SUITE_FAMILY_SUM_SUM',
 'PREVIOUS_NAME_TYPE_SUITE_GROUP OF PEOPLE_SUM_MIN',
 'PREVIOUS_NAME_TYPE_SUITE_GROUP OF PEOPLE_SUM_MEAN',
 'PREVIOUS_NAME_TYPE_SUITE_GROUP OF PEOPLE_SUM_SUM',
 'PREVIOUS_NAME_TYPE_SUITE_OTHER_A_SUM_MIN',
 'PREVIOUS_NAME_TYPE_SUITE_OTHER_A_SUM_MEAN',
 'PREVIOUS_NAME_TYPE_SUITE_OTHER_A_SUM_SUM',
 'PREVIOUS_NAME_TYPE_SUITE_OTHER_B_SUM_MIN',
 'PREVIOUS_NAME_TYPE_SUITE_OTHER_B_SUM_MEAN',
 'PREVIOUS_NAME_TYPE_SUITE_OTHER_B_SUM_SUM',
 'PREVIOUS_NAME_TYPE_SUITE_SPOUSE, PARTNER_SUM_MIN',
 'PREVIOUS_NAME_TYPE_SUITE_SPOUSE, PARTNER_SUM_MAX',
 'PREVIOUS_NAME_TYPE_SUITE_SPOUSE, PARTNER_SUM_MEAN',
 'PREVIOUS_NAME_TYPE_SUITE_SPOUSE, PARTNER_SUM_SUM',
 'PREVIOUS_NAME_TYPE_SUITE_UNACCOMPANIED_SUM_MIN',
 'PREVIOUS_NAME_TYPE_SUITE_UNACCOMPANIED_SUM_MAX',
 'PREVIOUS_NAME_TYPE_SUITE_UNACCOMPANIED_SUM_MEAN',
 'PREVIOUS_NAME_TYPE_SUITE_UNACCOMPANIED_SUM_SUM',
 'PREVIOUS_NAME_CLIENT_TYPE_NEW_SUM_MIN',
 'PREVIOUS_NAME_CLIENT_TYPE_NEW_SUM_MAX',
 'PREVIOUS_NAME_CLIENT_TYPE_NEW_SUM_MEAN',
 'PREVIOUS_NAME_CLIENT_TYPE_NEW_SUM_SUM',
 'PREVIOUS_NAME_CLIENT_TYPE_REFRESHED_SUM_MIN',
 'PREVIOUS_NAME_CLIENT_TYPE_REFRESHED_SUM_MAX',
 'PREVIOUS_NAME_CLIENT_TYPE_REFRESHED_SUM_MEAN',
 'PREVIOUS_NAME_CLIENT_TYPE_REFRESHED_SUM_SUM',
 'PREVIOUS_NAME_CLIENT_TYPE_REPEATER_SUM_MIN',
 'PREVIOUS_NAME_CLIENT_TYPE_REPEATER_SUM_MAX',
 'PREVIOUS_NAME_CLIENT_TYPE_REPEATER_SUM_MEAN',
 'PREVIOUS_NAME_CLIENT_TYPE_XNA_SUM_MIN',
 'PREVIOUS_NAME_CLIENT_TYPE_XNA_SUM_MAX',
 'PREVIOUS_NAME_CLIENT_TYPE_XNA_SUM_MEAN',
 'PREVIOUS_NAME_CLIENT_TYPE_XNA_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_ADDITIONAL SERVICE_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_ADDITIONAL SERVICE_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_ADDITIONAL SERVICE_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_ANIMALS_MEAN_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_ANIMALS_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_ANIMALS_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_AUDIO/VIDEO_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_AUDIO/VIDEO_SUM_MAX',
 'PREVIOUS_NAME_GOODS_CATEGORY_AUDIO/VIDEO_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_AUDIO/VIDEO_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_AUTO ACCESSORIES_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_AUTO ACCESSORIES_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_AUTO ACCESSORIES_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_COMPUTERS_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_COMPUTERS_SUM_MAX',
 'PREVIOUS_NAME_GOODS_CATEGORY_COMPUTERS_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_COMPUTERS_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_CONSTRUCTION MATERIALS_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_CONSTRUCTION MATERIALS_SUM_MAX',
 'PREVIOUS_NAME_GOODS_CATEGORY_CONSTRUCTION MATERIALS_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_CONSTRUCTION MATERIALS_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_CONSUMER ELECTRONICS_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_CONSUMER ELECTRONICS_SUM_MAX',
 'PREVIOUS_NAME_GOODS_CATEGORY_CONSUMER ELECTRONICS_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_CONSUMER ELECTRONICS_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_DIRECT SALES_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_DIRECT SALES_SUM_MAX',
 'PREVIOUS_NAME_GOODS_CATEGORY_DIRECT SALES_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_DIRECT SALES_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_EDUCATION_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_EDUCATION_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_EDUCATION_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_FITNESS_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_FITNESS_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_FURNITURE_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_FURNITURE_SUM_MAX',
 'PREVIOUS_NAME_GOODS_CATEGORY_FURNITURE_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_GARDENING_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_GARDENING_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_GARDENING_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOMEWARES_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOMEWARES_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOMEWARES_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOUSE CONSTRUCTION_MEAN_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOUSE CONSTRUCTION_MEAN_MAX',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOUSE CONSTRUCTION_MEAN_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOUSE CONSTRUCTION_MEAN_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOUSE CONSTRUCTION_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOUSE CONSTRUCTION_SUM_MAX',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOUSE CONSTRUCTION_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_HOUSE CONSTRUCTION_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_INSURANCE_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_INSURANCE_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_INSURANCE_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_JEWELRY_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_JEWELRY_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_JEWELRY_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_MEDICAL SUPPLIES_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_MEDICAL SUPPLIES_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_MEDICAL SUPPLIES_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_MEDICINE_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_MEDICINE_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_MEDICINE_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_MOBILE_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_MOBILE_SUM_MAX',
 'PREVIOUS_NAME_GOODS_CATEGORY_MOBILE_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_MOBILE_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_OFFICE APPLIANCES_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_OFFICE APPLIANCES_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_OFFICE APPLIANCES_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_OTHER_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_OTHER_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_OTHER_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_PHOTO / CINEMA EQUIPMENT_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_PHOTO / CINEMA EQUIPMENT_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_PHOTO / CINEMA EQUIPMENT_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_SPORT AND LEISURE_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_SPORT AND LEISURE_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_SPORT AND LEISURE_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_TOURISM_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_TOURISM_SUM_MAX',
 'PREVIOUS_NAME_GOODS_CATEGORY_TOURISM_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_TOURISM_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_VEHICLES_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_VEHICLES_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_VEHICLES_SUM_SUM',
 'PREVIOUS_NAME_GOODS_CATEGORY_WEAPON_MEAN_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_WEAPON_SUM_MIN',
 'PREVIOUS_NAME_GOODS_CATEGORY_WEAPON_SUM_MEAN',
 'PREVIOUS_NAME_GOODS_CATEGORY_WEAPON_SUM_SUM',
 'PREVIOUS_NAME_PORTFOLIO_CARDS_SUM_MAX',
 'PREVIOUS_NAME_PORTFOLIO_CARDS_SUM_MEAN',
 'PREVIOUS_NAME_PORTFOLIO_CARDS_SUM_SUM',
 'PREVIOUS_NAME_PORTFOLIO_CASH_SUM_MIN',
 'PREVIOUS_NAME_PORTFOLIO_CASH_SUM_MAX',
 'PREVIOUS_NAME_PORTFOLIO_CASH_SUM_MEAN',
 'PREVIOUS_NAME_PORTFOLIO_CASH_SUM_SUM',
 'PREVIOUS_NAME_PORTFOLIO_POS_SUM_MIN',
 'PREVIOUS_NAME_PORTFOLIO_POS_SUM_MAX',
 'PREVIOUS_NAME_PORTFOLIO_POS_SUM_MEAN',
 'PREVIOUS_NAME_PORTFOLIO_POS_SUM_SUM',
 'PREVIOUS_NAME_PORTFOLIO_XNA_SUM_MIN',
 'PREVIOUS_NAME_PORTFOLIO_XNA_SUM_MAX',
 'PREVIOUS_NAME_PORTFOLIO_XNA_SUM_MEAN',
 'PREVIOUS_NAME_PRODUCT_TYPE_XNA_SUM_MIN',
 'PREVIOUS_NAME_PRODUCT_TYPE_XNA_SUM_MAX',
 'PREVIOUS_NAME_PRODUCT_TYPE_XNA_SUM_MEAN',
 'PREVIOUS_NAME_PRODUCT_TYPE_XNA_SUM_SUM',
 'PREVIOUS_NAME_PRODUCT_TYPE_WALK-IN_SUM_MIN',
 'PREVIOUS_NAME_PRODUCT_TYPE_WALK-IN_SUM_MAX',
 'PREVIOUS_NAME_PRODUCT_TYPE_WALK-IN_SUM_MEAN',
 'PREVIOUS_NAME_PRODUCT_TYPE_WALK-IN_SUM_SUM',
 'PREVIOUS_NAME_PRODUCT_TYPE_X-SELL_SUM_MIN',
 'PREVIOUS_NAME_PRODUCT_TYPE_X-SELL_SUM_MAX',
 'PREVIOUS_NAME_PRODUCT_TYPE_X-SELL_SUM_MEAN',
 'PREVIOUS_NAME_PRODUCT_TYPE_X-SELL_SUM_SUM',
 'PREVIOUS_CHANNEL_TYPE_AP+ (CASH LOAN)_SUM_MIN',
 'PREVIOUS_CHANNEL_TYPE_AP+ (CASH LOAN)_SUM_MAX',
 'PREVIOUS_CHANNEL_TYPE_AP+ (CASH LOAN)_SUM_MEAN',
 'PREVIOUS_CHANNEL_TYPE_AP+ (CASH LOAN)_SUM_SUM',
 'PREVIOUS_CHANNEL_TYPE_CAR DEALER_SUM_MIN',
 'PREVIOUS_CHANNEL_TYPE_CAR DEALER_SUM_MEAN',
 'PREVIOUS_CHANNEL_TYPE_CAR DEALER_SUM_SUM',
 'PREVIOUS_CHANNEL_TYPE_CHANNEL OF CORPORATE SALES_SUM_MIN',
 'PREVIOUS_CHANNEL_TYPE_CHANNEL OF CORPORATE SALES_SUM_MAX',
 'PREVIOUS_CHANNEL_TYPE_CHANNEL OF CORPORATE SALES_SUM_MEAN',
 'PREVIOUS_CHANNEL_TYPE_CHANNEL OF CORPORATE SALES_SUM_SUM',
 'PREVIOUS_CHANNEL_TYPE_CONTACT CENTER_SUM_MIN',
 'PREVIOUS_CHANNEL_TYPE_CONTACT CENTER_SUM_MAX',
 'PREVIOUS_CHANNEL_TYPE_CONTACT CENTER_SUM_MEAN',
 'PREVIOUS_CHANNEL_TYPE_CONTACT CENTER_SUM_SUM',
 'PREVIOUS_CHANNEL_TYPE_COUNTRY-WIDE_SUM_MIN',
 'PREVIOUS_CHANNEL_TYPE_COUNTRY-WIDE_SUM_MAX',
 'PREVIOUS_CHANNEL_TYPE_COUNTRY-WIDE_SUM_MEAN',
 'PREVIOUS_CHANNEL_TYPE_COUNTRY-WIDE_SUM_SUM',
 'PREVIOUS_CHANNEL_TYPE_CREDIT AND CASH OFFICES_SUM_MIN',
 'PREVIOUS_CHANNEL_TYPE_CREDIT AND CASH OFFICES_SUM_MAX',
 'PREVIOUS_CHANNEL_TYPE_CREDIT AND CASH OFFICES_SUM_MEAN',
 'PREVIOUS_CHANNEL_TYPE_REGIONAL / LOCAL_SUM_MIN',
 'PREVIOUS_CHANNEL_TYPE_REGIONAL / LOCAL_SUM_MAX',
 'PREVIOUS_CHANNEL_TYPE_REGIONAL / LOCAL_SUM_MEAN',
 'PREVIOUS_CHANNEL_TYPE_REGIONAL / LOCAL_SUM_SUM',
 'PREVIOUS_CHANNEL_TYPE_STONE_SUM_MIN',
 'PREVIOUS_CHANNEL_TYPE_STONE_SUM_MAX',
 'PREVIOUS_CHANNEL_TYPE_STONE_SUM_MEAN',
 'PREVIOUS_CHANNEL_TYPE_STONE_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_AUTO TECHNOLOGY_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_AUTO TECHNOLOGY_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_AUTO TECHNOLOGY_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CLOTHING_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CLOTHING_SUM_MAX',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CLOTHING_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CLOTHING_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONNECTIVITY_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONNECTIVITY_SUM_MAX',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONNECTIVITY_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONNECTIVITY_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONSTRUCTION_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONSTRUCTION_SUM_MAX',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONSTRUCTION_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONSTRUCTION_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONSUMER ELECTRONICS_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONSUMER ELECTRONICS_SUM_MAX',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONSUMER ELECTRONICS_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_CONSUMER ELECTRONICS_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_FURNITURE_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_FURNITURE_SUM_MAX',
 'PREVIOUS_NAME_SELLER_INDUSTRY_FURNITURE_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_FURNITURE_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_INDUSTRY_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_INDUSTRY_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_INDUSTRY_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_JEWELRY_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_JEWELRY_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_JEWELRY_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_MLM PARTNERS_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_MLM PARTNERS_SUM_MAX',
 'PREVIOUS_NAME_SELLER_INDUSTRY_MLM PARTNERS_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_MLM PARTNERS_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_TOURISM_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_TOURISM_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_TOURISM_SUM_SUM',
 'PREVIOUS_NAME_SELLER_INDUSTRY_XNA_SUM_MIN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_XNA_SUM_MAX',
 'PREVIOUS_NAME_SELLER_INDUSTRY_XNA_SUM_MEAN',
 'PREVIOUS_NAME_SELLER_INDUSTRY_XNA_SUM_SUM',
 'PREVIOUS_NAME_YIELD_GROUP_XNA_SUM_MIN',
 'PREVIOUS_NAME_YIELD_GROUP_XNA_SUM_MAX',
 'PREVIOUS_NAME_YIELD_GROUP_XNA_SUM_MEAN',
 'PREVIOUS_NAME_YIELD_GROUP_XNA_SUM_SUM',
 'PREVIOUS_NAME_YIELD_GROUP_HIGH_SUM_MIN',
 'PREVIOUS_NAME_YIELD_GROUP_HIGH_SUM_MAX',
 'PREVIOUS_NAME_YIELD_GROUP_HIGH_SUM_MEAN',
 'PREVIOUS_NAME_YIELD_GROUP_HIGH_SUM_SUM',
 'PREVIOUS_NAME_YIELD_GROUP_LOW_ACTION_SUM_MIN',
 'PREVIOUS_NAME_YIELD_GROUP_LOW_ACTION_SUM_MAX',
 'PREVIOUS_NAME_YIELD_GROUP_LOW_ACTION_SUM_MEAN',
 'PREVIOUS_NAME_YIELD_GROUP_LOW_ACTION_SUM_SUM',
 'PREVIOUS_NAME_YIELD_GROUP_LOW_NORMAL_SUM_MIN',
 'PREVIOUS_NAME_YIELD_GROUP_LOW_NORMAL_SUM_MAX',
 'PREVIOUS_NAME_YIELD_GROUP_LOW_NORMAL_SUM_MEAN',
 'PREVIOUS_NAME_YIELD_GROUP_LOW_NORMAL_SUM_SUM',
 'PREVIOUS_NAME_YIELD_GROUP_MIDDLE_SUM_MIN',
 'PREVIOUS_NAME_YIELD_GROUP_MIDDLE_SUM_MAX',
 'PREVIOUS_NAME_YIELD_GROUP_MIDDLE_SUM_MEAN',
 'PREVIOUS_NAME_YIELD_GROUP_MIDDLE_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_CARD STREET_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_CARD STREET_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_CARD STREET_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_CARD STREET_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_CARD X-SELL_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_CARD X-SELL_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_CARD X-SELL_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_CARD X-SELL_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_CASH_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_CASH_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: HIGH_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: HIGH_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: HIGH_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: HIGH_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: LOW_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: LOW_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: LOW_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: LOW_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: MIDDLE_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: MIDDLE_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: MIDDLE_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH STREET: MIDDLE_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: HIGH_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: HIGH_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: HIGH_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: HIGH_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: LOW_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: LOW_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: LOW_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: LOW_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: MIDDLE_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: MIDDLE_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: MIDDLE_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_CASH X-SELL: MIDDLE_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_POS HOUSEHOLD WITH INTEREST_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_POS HOUSEHOLD WITH INTEREST_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_POS HOUSEHOLD WITH INTEREST_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_POS HOUSEHOLD WITH INTEREST_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_POS HOUSEHOLD WITHOUT INTEREST_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_POS HOUSEHOLD WITHOUT INTEREST_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_POS HOUSEHOLD WITHOUT INTEREST_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_POS HOUSEHOLD WITHOUT INTEREST_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_POS INDUSTRY WITH INTEREST_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_POS INDUSTRY WITH INTEREST_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_POS INDUSTRY WITH INTEREST_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_POS INDUSTRY WITH INTEREST_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_POS INDUSTRY WITHOUT INTEREST_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_POS INDUSTRY WITHOUT INTEREST_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_POS INDUSTRY WITHOUT INTEREST_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_POS MOBILE WITH INTEREST_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_POS MOBILE WITH INTEREST_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_POS MOBILE WITH INTEREST_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_POS MOBILE WITH INTEREST_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_POS MOBILE WITHOUT INTEREST_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_POS MOBILE WITHOUT INTEREST_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_POS MOBILE WITHOUT INTEREST_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_POS OTHER WITH INTEREST_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_POS OTHER WITH INTEREST_SUM_MAX',
 'PREVIOUS_PRODUCT_COMBINATION_POS OTHER WITH INTEREST_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_POS OTHER WITH INTEREST_SUM_SUM',
 'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MIN',
 'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_MEAN',
 'PREVIOUS_PRODUCT_COMBINATION_POS OTHERS WITHOUT INTEREST_SUM_SUM',
 'prev_applications_counts',
 'TARGET',
 'SK_ID_CURR',
 'CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX_SUM',
 'CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MIN_SUM',
 'CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MEAN_SUM',
 'CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_ATM_CURRENT_MIN_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_ATM_CURRENT_MAX_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_ATM_CURRENT_MEAN_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_ATM_CURRENT_SUM_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MIN_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MAX_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MEAN_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_OTHER_CURRENT_MIN_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_OTHER_CURRENT_MAX_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_OTHER_CURRENT_MEAN_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_OTHER_CURRENT_SUM_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_POS_CURRENT_MIN_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_POS_CURRENT_MAX_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_POS_CURRENT_MEAN_SUM',
 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_POS_CURRENT_SUM_SUM',
 'CREDIT_CARD_BALANCE_AMT_INST_MIN_REGULARITY_MIN_SUM',
 'CREDIT_CARD_BALANCE_AMT_PAYMENT_CURRENT_MIN_SUM',
 'CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MIN_SUM',
 'CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MAX_SUM',
 'CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MEAN_SUM',
 'CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_SUM_SUM',
 'CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MIN_SUM',
 'CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MAX_SUM',
 'CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MEAN_SUM',
 'CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_SUM_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_ATM_CURRENT_MIN_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_ATM_CURRENT_MAX_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_ATM_CURRENT_MEAN_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_ATM_CURRENT_SUM_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MIN_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_OTHER_CURRENT_MIN_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_OTHER_CURRENT_MAX_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_OTHER_CURRENT_MEAN_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_OTHER_CURRENT_SUM_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_POS_CURRENT_MIN_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_POS_CURRENT_MAX_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_POS_CURRENT_MEAN_SUM',
 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_POS_CURRENT_SUM_SUM',
 'CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN_SUM',
 'CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM_SUM',
 'CREDIT_CARD_BALANCE_SK_DPD_MIN_MIN',
 'CREDIT_CARD_BALANCE_SK_DPD_MIN_MAX',
 'CREDIT_CARD_BALANCE_SK_DPD_MIN_MEAN',
 'CREDIT_CARD_BALANCE_SK_DPD_MIN_SUM',
 'CREDIT_CARD_BALANCE_SK_DPD_SUM_SUM',
 'CREDIT_CARD_BALANCE_SK_DPD_DEF_MIN_MIN',
 'CREDIT_CARD_BALANCE_SK_DPD_DEF_MIN_MAX',
 'CREDIT_CARD_BALANCE_SK_DPD_DEF_MIN_MEAN',
 'CREDIT_CARD_BALANCE_SK_DPD_DEF_MIN_SUM',
 'CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_ACTIVE_MEAN_SUM',
 'CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_ACTIVE_SUM_SUM',
 'CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_APPROVED_SUM_SUM',
 'CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_COMPLETED_MEAN_SUM',
 'CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_COMPLETED_SUM_SUM',
 'CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_DEMAND_SUM_SUM',
 'CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_REFUSED_SUM_SUM',
 'CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SENT PROPOSAL_SUM_SUM',
 'CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SIGNED_MEAN_SUM',
 'CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS_SIGNED_SUM_SUM',
 'POS_NAME_CONTRACT_STATUS_ACTIVE_MEAN_MIN',
 'POS_NAME_CONTRACT_STATUS_ACTIVE_MEAN_MAX',
 'POS_NAME_CONTRACT_STATUS_ACTIVE_MEAN_MEAN',
 'POS_NAME_CONTRACT_STATUS_ACTIVE_MEAN_SUM',
 'POS_NAME_CONTRACT_STATUS_ACTIVE_SUM_MIN',
 'POS_NAME_CONTRACT_STATUS_ACTIVE_SUM_MAX',
 'POS_NAME_CONTRACT_STATUS_ACTIVE_SUM_MEAN',
 'POS_NAME_CONTRACT_STATUS_ACTIVE_SUM_SUM',
 'POS_NAME_CONTRACT_STATUS_AMORTIZED DEBT_MEAN_MIN',
 'POS_NAME_CONTRACT_STATUS_AMORTIZED DEBT_SUM_MIN',
 'POS_NAME_CONTRACT_STATUS_AMORTIZED DEBT_SUM_SUM',
 'POS_NAME_CONTRACT_STATUS_APPROVED_MEAN_MIN',
 'POS_NAME_CONTRACT_STATUS_APPROVED_MEAN_MEAN',
 'POS_NAME_CONTRACT_STATUS_APPROVED_MEAN_SUM',
 'POS_NAME_CONTRACT_STATUS_APPROVED_SUM_MIN',
 'POS_NAME_CONTRACT_STATUS_APPROVED_SUM_MEAN',
 'POS_NAME_CONTRACT_STATUS_APPROVED_SUM_SUM',
 'POS_NAME_CONTRACT_STATUS_CANCELED_SUM_MEAN',
 'POS_NAME_CONTRACT_STATUS_CANCELED_SUM_SUM',
 'POS_NAME_CONTRACT_STATUS_COMPLETED_MEAN_MIN',
 'POS_NAME_CONTRACT_STATUS_COMPLETED_MEAN_MAX',
 'POS_NAME_CONTRACT_STATUS_COMPLETED_MEAN_MEAN',
 'POS_NAME_CONTRACT_STATUS_COMPLETED_MEAN_SUM',
 'POS_NAME_CONTRACT_STATUS_COMPLETED_SUM_MIN',
 'POS_NAME_CONTRACT_STATUS_COMPLETED_SUM_MAX',
 'POS_NAME_CONTRACT_STATUS_COMPLETED_SUM_MEAN',
 'POS_NAME_CONTRACT_STATUS_COMPLETED_SUM_SUM',
 'POS_NAME_CONTRACT_STATUS_DEMAND_MEAN_MEAN',
 'POS_NAME_CONTRACT_STATUS_DEMAND_MEAN_SUM',
 'POS_NAME_CONTRACT_STATUS_DEMAND_SUM_MIN',
 'POS_NAME_CONTRACT_STATUS_DEMAND_SUM_MEAN',
 'POS_NAME_CONTRACT_STATUS_DEMAND_SUM_SUM',
 'POS_NAME_CONTRACT_STATUS_RETURNED TO THE STORE_MEAN_MIN',
 'POS_NAME_CONTRACT_STATUS_RETURNED TO THE STORE_MEAN_MEAN',
 'POS_NAME_CONTRACT_STATUS_RETURNED TO THE STORE_MEAN_SUM',
 'POS_NAME_CONTRACT_STATUS_RETURNED TO THE STORE_SUM_MIN',
 'POS_NAME_CONTRACT_STATUS_RETURNED TO THE STORE_SUM_MEAN',
 'POS_NAME_CONTRACT_STATUS_RETURNED TO THE STORE_SUM_SUM',
 'POS_NAME_CONTRACT_STATUS_SIGNED_MEAN_MIN',
 'POS_NAME_CONTRACT_STATUS_SIGNED_MEAN_MEAN',
 'POS_NAME_CONTRACT_STATUS_SIGNED_MEAN_SUM',
 'POS_NAME_CONTRACT_STATUS_SIGNED_SUM_MIN',
 'POS_NAME_CONTRACT_STATUS_SIGNED_SUM_MEAN',
 'POS_NAME_CONTRACT_STATUS_SIGNED_SUM_SUM',
 'POS_NAME_CONTRACT_STATUS_XNA_MEAN_MIN',
 'POS_NAME_CONTRACT_STATUS_XNA_MEAN_MAX',
 'POS_NAME_CONTRACT_STATUS_XNA_MEAN_MEAN',
 'POS_NAME_CONTRACT_STATUS_XNA_MEAN_SUM',
 'POS_NAME_CONTRACT_STATUS_XNA_SUM_MIN',
 'POS_NAME_CONTRACT_STATUS_XNA_SUM_MAX',
 'POS_NAME_CONTRACT_STATUS_XNA_SUM_MEAN',
 'POS_NAME_CONTRACT_STATUS_XNA_SUM_SUM',
 'POS_MONTHS_BALANCE_MEAN_MIN',
 'POS_MONTHS_BALANCE_MEAN_MAX',
 'POS_MONTHS_BALANCE_MEAN_MEAN',
 'POS_MONTHS_BALANCE_MEAN_SUM',
 'POS_MONTHS_BALANCE_SUM_MIN',
 'POS_MONTHS_BALANCE_SUM_MAX',
 'POS_MONTHS_BALANCE_SUM_MEAN',
 'POS_MONTHS_BALANCE_SUM_SUM',
 'POS_CNT_INSTALMENT_MIN_MIN',
 'POS_CNT_INSTALMENT_MIN_MAX',
 'POS_CNT_INSTALMENT_MIN_MEAN',
 'POS_CNT_INSTALMENT_FUTURE_MIN_MIN',
 'POS_CNT_INSTALMENT_FUTURE_MIN_MEAN',
 'POS_CNT_INSTALMENT_FUTURE_MIN_SUM',
 'POS_CNT_INSTALMENT_FUTURE_MEAN_MIN',
 'POS_CNT_INSTALMENT_FUTURE_MEAN_MAX',
 'POS_CNT_INSTALMENT_FUTURE_MEAN_MEAN',
 'POS_CNT_INSTALMENT_FUTURE_MEAN_SUM',
 'POS_CNT_INSTALMENT_FUTURE_SUM_MIN',
 'POS_CNT_INSTALMENT_FUTURE_SUM_MAX',
 'POS_CNT_INSTALMENT_FUTURE_SUM_MEAN',
 'POS_CNT_INSTALMENT_FUTURE_SUM_SUM',
 'POS_SK_DPD_MIN_MIN',
 'POS_SK_DPD_MIN_MEAN',
 'POS_SK_DPD_MIN_SUM',
 'POS_SK_DPD_MAX_MIN',
 'POS_SK_DPD_MAX_MEAN',
 'POS_SK_DPD_SUM_MIN',
 'POS_SK_DPD_SUM_MEAN',
 'POS_SK_DPD_SUM_SUM',
 'POS_SK_DPD_DEF_MIN_MIN',
 'POS_SK_DPD_DEF_MIN_SUM',
 'POS_SK_DPD_DEF_MAX_MIN',
 'POS_SK_DPD_DEF_SUM_MIN',
 'POS_SK_DPD_DEF_SUM_SUM',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_MIN_MIN',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_MIN_MAX',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_MIN_MEAN',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_MAX_MIN',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_MEAN_MIN',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_MEAN_MAX',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_MEAN_MEAN',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_MEAN_SUM',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_SUM_MIN',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_SUM_MAX',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_SUM_MEAN',
 'INSTALLMENTS_NUM_INSTALMENT_VERSION_SUM_SUM',
 'INSTALLMENTS_NUM_INSTALMENT_NUMBER_MIN_MIN',
 'INSTALLMENTS_NUM_INSTALMENT_NUMBER_MIN_MAX',
 'INSTALLMENTS_NUM_INSTALMENT_NUMBER_MIN_MEAN',
 'INSTALLMENTS_NUM_INSTALMENT_NUMBER_MIN_SUM',
 'INSTALLMENTS_NUM_INSTALMENT_NUMBER_MAX_MIN',
 'INSTALLMENTS_NUM_INSTALMENT_NUMBER_MAX_MEAN',
 'INSTALLMENTS_NUM_INSTALMENT_NUMBER_MEAN_MIN',
 'INSTALLMENTS_NUM_INSTALMENT_NUMBER_SUM_MIN',
 'INSTALLMENTS_NUM_INSTALMENT_NUMBER_SUM_MEAN',
 'INSTALLMENTS_DAYS_INSTALMENT_MEAN_MIN',
 'INSTALLMENTS_DAYS_INSTALMENT_MEAN_MAX',
 'INSTALLMENTS_DAYS_INSTALMENT_MEAN_MEAN',
 'INSTALLMENTS_DAYS_ENTRY_PAYMENT_MEAN_SUM',
 'INSTALLMENTS_DAYS_ENTRY_PAYMENT_SUM_MAX',
 'INSTALLMENTS_DAYS_ENTRY_PAYMENT_SUM_MEAN',
 'INSTALLMENTS_DAYS_ENTRY_PAYMENT_SUM_SUM',
 'INSTALLMENTS_AMT_INSTALMENT_MIN_MIN',
 'INSTALLMENTS_AMT_INSTALMENT_MIN_MEAN',
 'INSTALLMENTS_AMT_INSTALMENT_MAX_MIN',
 'INSTALLMENTS_AMT_INSTALMENT_MAX_MEAN',
 'INSTALLMENTS_AMT_INSTALMENT_MEAN_MIN',
 'INSTALLMENTS_AMT_INSTALMENT_MEAN_MEAN',
 'INSTALLMENTS_AMT_PAYMENT_MIN_SUM',
 'INSTALLMENTS_AMT_PAYMENT_MAX_SUM',
 'INSTALLMENTS_AMT_PAYMENT_MEAN_SUM',
 'INSTALLMENTS_AMT_PAYMENT_SUM_MIN',
 'INSTALLMENTS_AMT_PAYMENT_SUM_MAX',
 'INSTALLMENTS_AMT_PAYMENT_SUM_MEAN',
 'INSTALLMENTS_AMT_PAYMENT_SUM_SUM',
 'bureau_CREDIT_ACTIVE_Active_sum',
 'bureau_CREDIT_ACTIVE_Bad debt_mean',
 'bureau_CREDIT_ACTIVE_Closed_mean',
 'bureau_CREDIT_ACTIVE_Sold_sum',
 'bureau_CREDIT_ACTIVE_Sold_mean',
 'bureau_CREDIT_CURRENCY_currency 2_sum',
 'bureau_CREDIT_CURRENCY_currency 2_mean',
 'bureau_CREDIT_CURRENCY_currency 3_sum',
 'bureau_CREDIT_CURRENCY_currency 3_mean',
 'bureau_CREDIT_CURRENCY_currency 4_mean',
 'bureau_CREDIT_TYPE_Another type of loan_sum',
 'bureau_CREDIT_TYPE_Another type of loan_mean',
 'bureau_CREDIT_TYPE_Car loan_sum',
 'bureau_CREDIT_TYPE_Car loan_mean',
 'bureau_CREDIT_TYPE_Cash loan (non-earmarked)_mean',
 'bureau_CREDIT_TYPE_Credit card_sum',
 'bureau_CREDIT_TYPE_Credit card_mean',
 'bureau_CREDIT_TYPE_Interbank credit_mean',
 'bureau_CREDIT_TYPE_Loan for business development_sum',
 'bureau_CREDIT_TYPE_Loan for business development_mean',
 'bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_mean',
 'bureau_CREDIT_TYPE_Loan for the purchase of equipment_sum',
 'bureau_CREDIT_TYPE_Loan for the purchase of equipment_mean',
 'bureau_CREDIT_TYPE_Loan for working capital replenishment_sum',
 'bureau_CREDIT_TYPE_Loan for working capital replenishment_mean',
 'bureau_CREDIT_TYPE_Microloan_sum',
 'bureau_CREDIT_TYPE_Microloan_mean',
 'bureau_CREDIT_TYPE_Mobile operator loan_mean',
 'bureau_CREDIT_TYPE_Mortgage_sum',
 'bureau_CREDIT_TYPE_Mortgage_mean',
 'bureau_CREDIT_TYPE_Real estate loan_sum',
 'bureau_CREDIT_TYPE_Real estate loan_mean',
 'bureau_CREDIT_TYPE_Unknown type of loan_sum',
 'bureau_CREDIT_TYPE_Unknown type of loan_mean',
 'bureau_DAYS_CREDIT_mean',
 'bureau_DAYS_CREDIT_max',
 'bureau_CREDIT_DAY_OVERDUE_min',
 'bureau_CREDIT_DAY_OVERDUE_mean',
 'bureau_CREDIT_DAY_OVERDUE_sum',
 'bureau_DAYS_CREDIT_ENDDATE_min',
 'bureau_DAYS_CREDIT_ENDDATE_mean',
 'bureau_DAYS_CREDIT_ENDDATE_sum',
 'bureau_DAYS_ENDDATE_FACT_mean',
 'bureau_DAYS_ENDDATE_FACT_max',
 'bureau_AMT_CREDIT_MAX_OVERDUE_sum',
 ...]

hgbt#

%%time
from sklearn.ensemble import HistGradientBoostingClassifier

hist_gradient_boost_model= HistGradientBoostingClassifier(
    max_iter = 100, # 树个数
    learning_rate = 0.1,
    max_depth = 5,
)
hist_gradient_boost_model.fit(train_features, train_labels)
CPU times: total: 7min 19s
Wall time: 53.1 s
HistGradientBoostingClassifier(max_depth=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
train_prob = hist_gradient_boost_model.predict_proba(train_features)
plot_roc(train_labels, train_prob[:,1], 'hist gb')
../../_images/8b6059464b30c7a82c4bcce4c5f90f0a20ff2d52f3ca733a4b1b51ff6cb45f42.png
import time
import os

def submit(ids, pred, name, feature_count=None):
    """
    ids: 测试集的 SK_ID_CURR
    pred: 模型预测概率
    name: 你的实验备注 (如 'lgb_v1', 'baseline')
    feature_count: 可选,记录模型使用了多少个特征
    """
    # 1. 创建提交 DataFrame
    submit_df = pd.DataFrame({
        'SK_ID_CURR': ids,
        'TARGET': pred
    })

    # 2. 生成时间戳 (格式: 0213_1530)
    timestamp = time.strftime("%m%d_%H%M")
    
    # 3. 构造文件名
    # 格式: 0213_1530_lgb_v1_f542.csv
    f_str = f"_f{feature_count}" if feature_count else ""
    filename = f"{timestamp}_{name}{f_str}.csv"
    
    # 4. 确保保存目录存在 (可选)
    if not os.path.exists('submissions'):
        os.makedirs('submissions')
    
    save_path = os.path.join('submissions', filename)
    
    # 5. 保存并打印提示
    submit_df.to_csv(save_path, index=False)
    
    return submit_df
submit_df = submit(test['SK_ID_CURR'], hist_gradient_boost_model_pred[:, 1], 
    name='hgbm_baseline',
    feature_count=train_features.shape[1]
    )
submit_df
SK_ID_CURR TARGET
0 100001 0.041575
1 100005 0.161436
2 100013 0.020080
3 100028 0.029833
4 100038 0.178684
... ... ...
48739 456221 0.067602
48740 456222 0.085417
48741 456223 0.022706
48742 456224 0.053502
48743 456250 0.184635

48744 rows × 2 columns

得分 74, 有点不太合理

lightbgm#

train_features_cleaned = clean_names(train_features)
test_features_cleaned = clean_names(test_features)
%%time
from lightgbm import LGBMClassifier
lgbm_model = LGBMClassifier(
    n_estimators=100,      # 对应 max_iter,树的个数
    learning_rate=0.1,     # 学习率
    max_depth=3,           # 树的最大深度
    random_state=42,       # 保证结果可复现
    n_jobs=-1              # 使用所有 CPU 核心加速
)
lgbm_model.fit(train_features_cleaned, train_labels)
[LightGBM] [Info] Number of positive: 24825, number of negative: 282686
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.102541 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 73666
[LightGBM] [Info] Number of data points in the train set: 307511, number of used features: 1058
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432486
[LightGBM] [Info] Start training from score -2.432486
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
CPU times: total: 2min 13s
Wall time: 14.7 s
LGBMClassifier(max_depth=3, n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
lgbm_model_pred = lgbm_model.predict_proba(test_features_cleaned)[:, 1]
submit_df = submit(test['SK_ID_CURR'], lgbm_model_pred, 
    name='lgbm_baseline',
    feature_count=train_features.shape[1]
    )
submit_df
SK_ID_CURR TARGET
0 100001 0.055758
1 100005 0.142900
2 100013 0.029373
3 100028 0.034231
4 100038 0.148776
... ... ...
48739 456221 0.042909
48740 456222 0.063323
48741 456223 0.028442
48742 456224 0.048157
48743 456250 0.175329

48744 rows × 2 columns

features_importance = pd.DataFrame(
    {
        'importance': lgbm_model.feature_importances_,
        'feature': lgbm_model.feature_name_
    }
)
def plot_features_importance(df):
    df = df.sort_values(by='importance', ascending=False).head(20)
    plt.figure(figsize=(10,6))
    sns.barplot(
        data = df,
        x= 'importance',
        y = 'feature'
    )
    plt.tight_layout()
plot_features_importance(features_importance)
../../_images/16941c248f75395b82db466eecdc27059dae5ada1b47ffde3576affcc2653a2c.png

可以看到,我们很多特征都有效了。

得分 74