初步#

目标#

这个notebook,展示了机器学习基本流程:

  1. 了解数据,任务,提交评价标准

  2. 简单EDA,探索特征和目标的相关性,做出猜想

    • 进行必要的分类特征编码

    • 清楚特征中缺失值、异常值,多与少,对目标的影响

    • 重要特征之间的相关性

  3. 特征工程:这是可选的。为了模型的改善

    • 通过各种方式,产出新的更加重要特征,筛选特征

  4. 基线模型:

    • 逻辑回归、随机森林这些几乎不需要做处理的简单模型

  5. 改善模型:

    • 使用特征工程等

  6. 解释模式:

    • 这一点往往很难做到,我们尽量

这个notebook使用了application_train/test两个表

提交的是概率,而不是分类结果

导入包#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.preprocessing import LabelEncoder
plt.rcParams['figure.figsize'] = (5, 3) # 设置默认长宽
# 设置中文字体(Windows常用SimHei,Mac常用Arial Unicode MS)
plt.rcParams['font.sans-serif'] = ['SimHei'] 
# 解决负号 '-' 显示为方块的问题
plt.rcParams['axes.unicode_minus'] = False 
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

数据读取#

os.listdir('data')
['application_test.csv',
 'application_train.csv',
 'bureau.csv',
 'bureau_balance.csv',
 'credit_card_balance.csv',
 'HomeCredit_columns_description.csv',
 'installments_payments.csv',
 'log_regress_model_baseline.csv',
 'POS_CASH_balance.csv',
 'previous_application.csv',
 'random_forest_baseline.csv',
 'sample_submission.csv']
application_train = pd.read_csv('data/application_train.csv')
application_test = pd.read_csv('data/application_test.csv')
bureau = pd.read_csv('data/bureau.csv')
bureau_balance = pd.read_csv('data/bureau_balance.csv')
credit_card_balance = pd.read_csv('data/credit_card_balance.csv')
installments_payments = pd.read_csv('data/installments_payments.csv')
previous_application = pd.read_csv('data/previous_application.csv')
pos_cash_balance = pd.read_csv('data/POS_CASH_balance.csv')
application_train.shape
(307511, 122)
application_train.columns.tolist()
['SK_ID_CURR',
 'TARGET',
 'NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'CNT_CHILDREN',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'AMT_ANNUITY',
 'AMT_GOODS_PRICE',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'REGION_POPULATION_RELATIVE',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_REGISTRATION',
 'DAYS_ID_PUBLISH',
 'OWN_CAR_AGE',
 'FLAG_MOBIL',
 'FLAG_EMP_PHONE',
 'FLAG_WORK_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_PHONE',
 'FLAG_EMAIL',
 'OCCUPATION_TYPE',
 'CNT_FAM_MEMBERS',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'WEEKDAY_APPR_PROCESS_START',
 'HOUR_APPR_PROCESS_START',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY',
 'ORGANIZATION_TYPE',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BEGINEXPLUATATION_AVG',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'ELEVATORS_AVG',
 'ENTRANCES_AVG',
 'FLOORSMAX_AVG',
 'FLOORSMIN_AVG',
 'LANDAREA_AVG',
 'LIVINGAPARTMENTS_AVG',
 'LIVINGAREA_AVG',
 'NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAREA_AVG',
 'APARTMENTS_MODE',
 'BASEMENTAREA_MODE',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BUILD_MODE',
 'COMMONAREA_MODE',
 'ELEVATORS_MODE',
 'ENTRANCES_MODE',
 'FLOORSMAX_MODE',
 'FLOORSMIN_MODE',
 'LANDAREA_MODE',
 'LIVINGAPARTMENTS_MODE',
 'LIVINGAREA_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'NONLIVINGAREA_MODE',
 'APARTMENTS_MEDI',
 'BASEMENTAREA_MEDI',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'YEARS_BUILD_MEDI',
 'COMMONAREA_MEDI',
 'ELEVATORS_MEDI',
 'ENTRANCES_MEDI',
 'FLOORSMAX_MEDI',
 'FLOORSMIN_MEDI',
 'LANDAREA_MEDI',
 'LIVINGAPARTMENTS_MEDI',
 'LIVINGAREA_MEDI',
 'NONLIVINGAPARTMENTS_MEDI',
 'NONLIVINGAREA_MEDI',
 'FONDKAPREMONT_MODE',
 'HOUSETYPE_MODE',
 'TOTALAREA_MODE',
 'WALLSMATERIAL_MODE',
 'EMERGENCYSTATE_MODE',
 'OBS_30_CNT_SOCIAL_CIRCLE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'DAYS_LAST_PHONE_CHANGE',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR']

字段说明#

application_train.csv#

application_train 共有122个字段,主要分类,转换为还款能力和还款意愿

  • SK_ID_CURR: 贷款申请ID

  • TARGET: 贷款是否违约,1为违约,0为正常还

个人信息

  • CODE_GENDER: 性别

  • DAYS_BIRTH: 年龄

  • DAYS_REGISTRATION: 客户修改居住地址。

  • DAYS_ID_PUBLISH: 身份证更新时间,身份证更新越频繁,说明个人信息变动越大,风险也越大

  • FLAG_MOBIL: 是否有手机; FLAG_EMP_PHONE: 是否有工作电话; FLAG_WORK_PHONE: 是否有工作单位电话; FLAG_CONT_MOBILE: 是否有紧急联系人电话; FLAG_PHONE: 是否有电话; FLAG_EMAIL: 是否有电子邮箱

    • 都是01值

  • DAYS_LAST_PHONE_CHANGE: 最后一次更换联系电话的时间,天数表示

    • 联系电话变动越频繁,说明个人信息变动越大,风险也越大

  • NAME_EDUCATION_TYPE : 教育水平

    • Lower secondary:初中(最低);Secondary / secondary special:高中 / 中专(数据中的“大部队”,占比约 70%);Incomplete higher:大学肄业(上过大学但没毕业);Higher education:本科 / 研究生(高等教育);academic degree:博士 / 教授(最高,但人数极少)

  • FLAG_OWN_CAR: 是否拥有汽车;

    • 应该是很重要的特征,

  • OWN_CAR_AGE: 汽车年龄,车龄越大,说明经济状况越差

房产

  • FLAG_OWN_REALTY: 是否拥有房产

  • FONDKAPREMONT_MODE: 房屋管理模式

  • HOUSETYPE_MODE: 房屋类型

  • TOTALAREA_MODE: 房屋总面积

  • WALLSMATERIAL_MODE: 墙体材料

  • EMERGENCYSTATE_MODE: 房屋是否危急

  • 反映所在建筑的情况:

    • 平均值情况

      • APARTMENTS_AVG: g 房屋面积

      • YEARS_BUILD_AVG: 房龄,房龄越大,说明经济状况越差

      • BASEMENTAREA_AVG: 地下室面积

      • YEARS_BEGINEXPLUATATION_AVG: 房屋使用年限

      • YEAR_BUILD_AVG: 房屋建成年代

      • COMMONAREA_AVG: 公共区域面积

      • ELEVATORS_AVG: 电梯数量

      • ENTRANCES_AVG: 入口数量

      • FLOORSMAX_AVG: 最大楼层数

      • FLOORSMIN_AVG: 最小楼层数

      • LANDAREA_AVG: 土地面积

      • LIVINGAPARTMENTS_AVG: 居住公寓面积

      • LIVINGAREA_AVG: 居住面积

      • NONLIVINGAPARTMENTS_AVG: 非居住公寓面积

      • NONLIVINGAREA_AVG: 非居住面积

    • 众数情况

      • NONLIVINGAREA_MODE: 非居住面积

      • APARTMENTS_MODE: g 房屋面积

      • YEARS_BUILD_MODE: 房龄,房龄越大,说明经济状况越差

      • BASEMENTAREA_MODE: 地下室面积

      • YEARS_BEGINEXPLUATATION_MODE: 房屋使用年限

      • YEAR_BUILD_MODE: 房屋建成年代

      • COMMONAREA_MODE: 公共区域面积

      • ELEVATORS_MODE: 电梯数量

      • ENTRANCES_MODE: 入口数量

      • FLOORSMAX_MODE: 最大楼层数

      • FLOORSMIN_MODE: 最小楼层数

      • LANDAREA_MODE: 土地面积

      • LIVINGAPARTMENTS_MODE: 居住公寓面积

      • LIVINGAREA_MODE: 居住面积

      • NONLIVINGAPARTMENTS_MODE: 非居住公寓面积

    • 中数情况:

      • ….. 同上

地区住址

  • REGION_POPULATION_RELATIVE: 居住地区人口密度。数值大为大城市

  • REGION_RATING_CLIENT: 居住地区评级; REGION_RATING_CLIENT_W_CITY: 考虑了城市因素的地区评级

    • 大城市风险较低

  • REG_REGION_NOT_LIVE_REGION: 是否注册地区等于所在地区(省);

  • REG_REGION_NOT_WORK_REGION: 是否注册地区等于工作地区(省);

  • LIVE_REGION_NOT_WORK_REGION: 是否居住地区等于工作地区(省)

  • REG_CITY_NOT_LIVE_CITY: 是否在登记城市居住;

  • REG_CITY_NOT_WORK_CITY: 是否在登记城市工作;

  • LIVE_CITY_NOT_WORK_CITY: 是否在居住城市工作

    • 是否跨省流动、跨市流动、异地务工…

    • 如果不在登记地区居住或者工作,说明流动性大,风险也大

社交圈

  • OBS_30_CNT_SOCIAL_CIRCLE: 过去30天内社交圈中有多少人借钱; OBS_60_CNT_SOCIAL_CIRCLE

  • DEF_30_CNT_SOCIAL_CIRCLE: 过去30天内社交圈中有多少人违约, DEF_60_CNT_SOCIAL_CIRCLE

家庭状况

  • NAME_FAMILY_STATUS: 家庭状况,已婚,单身,离异等

  • CNT_CHILDREN: 子女数量; CNT_FAM_MEMBERS: 家庭成员数量

    • 家庭负担越重,抗风险能力也差

经济能力

  • AMT_INCOME_TOTAL: 年收入

  • AMT_CREDIT: 贷款金额;

  • AMT_ANNUITY: 贷款年金,每个月还多少

  • AMT_GOODS_PRICE: 贷款购买商品的价格。比如我想买1000的电脑,银行借给我多少

  • NAME_INCOME_TYPE: 收入类型

    • 工资收入,养老金,商业贷款等,不同收入类型的还款能力不同

  • NAME_HOUSING_TYPE: 住房类型

    • 自有住房,租赁住房等,不同住房类型的还款能力不同

信用历史,外部评分

  • EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3: 外部其他征信机构对他的评分

    • 这个应该是很重要的

  • 征信查询次数:

    • AMT_REQ_CREDIT_BUREAU_HOUR: 过去一小时内查询信用局的次数;

    • AMT_REQ_CREDIT_BUREAU_DAY: 过去一天内查询信用局的次数;

    • AMT_REQ_CREDIT_BUREAU_WEEK: 过去一周内查询信用局的次数;

    • AMT_REQ_CREDIT_BUREAU_MON: 过去一个月内查询信用局的次数;

    • AMT_REQ_CREDIT_BUREAU_QRT: 过去一个季度内查询信用局的次数;

    • AMT_REQ_CREDIT_BUREAU_YEAR: 过去一年内查询信用局的次数

      • 查询次数越多,说明借款人越急需资金,风险也越大

工作情况

  • DAYS_EMPLOYED: 工作年限天数

  • OCCUPATION_TYPE: 职业类型

  • NAME_INCOME_TYPE: 收入类型

    • 负值表示倒退的天数

  • ORGANIZATION_TYPE: 工作单位类型

    • 国企,私企,事业单位等,不同单位类型的还款能力

申请行为

  • NAME_CONTRACT_TYPE: 贷款合同类型,现金贷款还是循环贷款。现金贷款直接给全部,循环贷款可以分期还款。

  • NAME_TYPE_SUITE: 申请时候的陪同人员类型

  • WEEKDAY_APPR_PROCESS_START: 申请的星期几

  • HOUR_APPR_PROCESS_START: 几点申请的, 0-23

    • 半夜申请和周末申请贷款,可能风险更大

  • FLAG_DOCUMENT_2FLAG_DOCUMENT_21: 提交的证明文件数。 都是01值

    • 提交的身份证件越多,说明资料越齐全,风险也越小

bureau.csv#

bureau 共有17个字段,每行为借款人在其他贷款机构的贷历史情况

  • SK_ID_CURR: 贷款申请ID. 与application_train中的SK_ID_CURR对应

  • SK_ID_BUREAU: 其他贷款机构的贷款ID. 关联bureau_balance中的SK_ID_BUREAU

  • CREDIT_ACTIVE: 贷款状态,活跃,结清等

  • CREDIT_CURRENCY: 贷款币种

  • CREDIT_TYPE: 贷款类型,信用卡,汽车贷款等

  • 日期时间

    • DAYS_CREDIT: 贷款申请距今的天数

    • DAYS_CREDIT_ENDDATE: 贷款结束距今的天数(到期日)

    • DAYS_ENDDATE_FACT: 贷款实际结束距今的天数. 对已经结清的贷款有效

    • DAYS_CREDIT_UPDATE: 贷款最后一次更新距今的天数

  • 额度,金额, 逾期

    • CREDIT_DAY_OVERDUE: 贷款逾期天数

    • AMT_CREDIT_MAX_OVERDUE: 这笔贷款历史上最大逾期金额

    • CNT_CREDIT_PROLONG: 这笔贷款展期次数. 延长还款期限的次数

    • AMT_CREDIT_SUM: 这笔贷款的总授信额度

    • AMT_CREDIT_SUM_DEBT: 这笔贷款的未还金额

    • AMT_CREDIT_SUM_LIMIT: 这笔贷款的额度上限

    • AMT_CREDIT_SUM_OVERDUE: 这笔贷款的逾期金额

    • AMT_ANNUITY: 这笔贷款的年金,每个月还多少

bureau_balance.csv#

bureau_balance 共有3个字段,记录bureau中每笔贷款的每月状态

  • SK_ID_BUREAU: 其他贷款机构的贷款ID. 关联bureau中的SK_ID_BUREAU

  • MONTHS_BALANCE: 记录月份,贷款申请到现在的月数

  • STATUS: 贷款状态

    • 0: 按时还款

    • 1: 逾期1-30天

    • 2: 逾期31-60天

    • 3: 逾期61-90天

    • 4: 逾期91-120天

    • 5: 逾期120天以上

    • C: 结清

    • X: 无贷款

bureau_balance.head()
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C

credit_card_balance.csv#

credit_card_balance 共有23个字段,记录信用卡每月的账单

  • SK_ID_CURR: 贷款申请ID. 与application_train中的SK_ID_CURR对应

  • SK_ID_PREV: 贷款ID. 关联previous_application中的SK_ID_PREV

  • MONTHS_BALANCE: 记录月份,贷款申请到现在的月数

  • NAME_CONTRACT_STATUS: 信用卡合同状态

    • Active: 活跃

    • Closed: 关闭

    • Sold: 出售

    • Demand: 需求

  • AMT_BALANCE: 账单余额

  • AMT_CREDIT_LIMIT_ACTUAL: 实际信用额度

  • AMT_DRAWINGS_ATM_CURRENT: 目前ATM取现额度

  • AMT_DRAWINGS_CURRENT: 目前取现额度

  • AMT_DRAWINGS_OTHER_CURRENT: 目前其他取现额度

  • AMT_DRAWINGS_POS_CURRENT: 目前POS取现额度

  • AMT_INST_MIN_REGULARITY: 最低还款金额

  • AMT_PAYMENT_CURRENT: 本期还款金额

  • AMT_PAYMENT_TOTAL_CURRENT: 本期应还金额

  • AMT_RECEIVABLE_PRINCIPAL: 应收本金

  • AMT_RECIVABLE: 应收总额

  • AMT_TOTAL_RECEIVABLE: 应收总额

  • CNT_DRAWINGS_ATM_CURRENT: 目前ATM取现次数

  • CNT_DRAWINGS_CURRENT: 目前取现次数

  • CNT_DRAWINGS_OTHER_CURRENT: 目前其他取现次数

  • CNT_DRAWINGS_POS_CURRENT: 目前POS取现次数

  • CNT_INSTALMENT_MATURE_CUM: 累计分期付款次数

  • SK_DPD: 逾期天数

  • SK_DPD_DEF: 逾期天数,严重逾期

credit_card_balance.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'AMT_BALANCE',
       'AMT_CREDIT_LIMIT_ACTUAL', 'AMT_DRAWINGS_ATM_CURRENT',
       'AMT_DRAWINGS_CURRENT', 'AMT_DRAWINGS_OTHER_CURRENT',
       'AMT_DRAWINGS_POS_CURRENT', 'AMT_INST_MIN_REGULARITY',
       'AMT_PAYMENT_CURRENT', 'AMT_PAYMENT_TOTAL_CURRENT',
       'AMT_RECEIVABLE_PRINCIPAL', 'AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE',
       'CNT_DRAWINGS_ATM_CURRENT', 'CNT_DRAWINGS_CURRENT',
       'CNT_DRAWINGS_OTHER_CURRENT', 'CNT_DRAWINGS_POS_CURRENT',
       'CNT_INSTALMENT_MATURE_CUM', 'NAME_CONTRACT_STATUS', 'SK_DPD',
       'SK_DPD_DEF'],
      dtype='str')

previous_application.csv#

记录了借款人之前在home credit机构的贷款申请情况,共有37个字段

  • SK_ID_CURR: 贷款申请ID. 与application_train中的SK_ID_CURR对应

  • SK_ID_PREV: 贷款ID. 关联credit_card_balance中的SK_ID_PREV 。 表示之前的贷款ID

previous_application.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
       'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
       'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
       'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
       'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
       'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
       'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
       'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
       'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
       'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
      dtype='str')

贷款金额 (Core Info & Amounts)

  • NAME_CONTRACT_TYPE: 贷款类型(如:现金贷款、消费贷款等)。

  • AMT_APPLICATION: 客户申请的贷款金额。

  • AMT_CREDIT: 银行最终实际审批的贷款金额。

  • AMT_ANNUITY: 该笔贷款的每月年金(月还款额)。

  • AMT_GOODS_PRICE: 如果是消费贷,代表商品本身的实际价格。

  • AMT_DOWN_PAYMENT: 首付款金额。

利率与审批状态 (Status & Interest)

  • NAME_CONTRACT_STATUS: 审批状态(非常关键:Approved 批准, Refused 拒绝, Canceled 取消, Unused offer 未使用的额度)

  • DAYS_DECISION: 相比于当前申请,这笔旧申请是在多少天前决定的(负数,如 -100 表示 100 天前)。

  • CODE_REJECT_REASON: 如果被拒绝,拒绝的原因代码。

  • RATE_DOWN_PAYMENT: 首付款比例(归一化后的数据)。

  • RATE_INTEREST_PRIMARY / PRIVILEGED: 利率相关信息(这两列通常缺失值极多)。

客户属性与贷款用途 (Client & Purpose)

  • NAME_CASH_LOAN_PURPOSE: 现金贷款的用途。

  • NAME_PAYMENT_TYPE: 客户选择的还款方式。

  • NAME_CLIENT_TYPE: 客户类型(新客户、老客户、续约客户等)。

  • NAME_GOODS_CATEGORY: 购买商品的类别(电脑、手机、建材等)。

  • NAME_PORTFOLIO: 贷款组合(POS 贷、现金贷、卡类等)。

  • NAME_PRODUCT_TYPE: 产品类型(x-sell 交叉销售, walk-in 步入式等)。

销售渠道与地点 (Sales & Channel)

  • CHANNEL_TYPE: 获取客户的渠道(如:石材店、电子产品店、电话销售等)。

  • SELLERPLACE_AREA: 卖方地点的面积(反映店铺规模)。

  • NAME_SELLER_INDUSTRY: 卖方的行业(如:连通器、家具、汽车等)。

  • PRODUCT_COMBINATION: 产品的详细组合名称。

时间线与还款细节 (Timeline & Insure)

  • CNT_PAYMENT: 申请时的分期期数(比如分 12 个月还)。

  • NAME_YIELD_GROUP: 收益率分组(反映这笔贷款对银行的获利高低,如 high, low, normal)。

  • DAYS_FIRST_DRAWING: 第一次放款的时间。

  • DAYS_FIRST_DUE: 第一笔款项应还的时间。

  • DAYS_LAST_DUE_1ST_VERSION: 原计划最后一笔款项应还的时间。

  • DAYS_LAST_DUE: 实际最后一笔款项还清的时间。

  • DAYS_TERMINATION: 贷款合同预期的终止时间。

  • NFLAG_INSURED_ON_APPROVAL: 申请时客户是否申请了保险。

  • NFLAG_LAST_APPL_IN_DAY 这是否是该客户在当天申请的最后一笔贷款?

installments_payments.csv#

所有流水, 包括 信用卡、和pos贷款

  • NUM_INSTALMENT_VERSION记录了该笔贷款的还款计划变动了多少次。

  • NUM_INSTALMENT_NUMBER:第几期还款。

  • DAYS_INSTALMENT:计划还款日。

  • DAYS_ENTRY_PAYMENT:实际还款日(关键!)。

  • AMT_INSTALMENT:计划还款金额。

  • AMT_PAYMENT:实际还款金额(关键!)。

installments_payments.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NUM_INSTALMENT_VERSION',
       'NUM_INSTALMENT_NUMBER', 'DAYS_INSTALMENT', 'DAYS_ENTRY_PAYMENT',
       'AMT_INSTALMENT', 'AMT_PAYMENT'],
      dtype='str')

pos_cash_balance.csv#

  • 于previous_banlance关联,描述pos贷状态

pos_cash_balance.head()
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.0 45.0 Active 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0
3 1903291 269225 -35 48.0 42.0 Active 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0
pos_cash_balance.dtypes
SK_ID_PREV                 int64
SK_ID_CURR                 int64
MONTHS_BALANCE             int64
CNT_INSTALMENT           float64
CNT_INSTALMENT_FUTURE    float64
NAME_CONTRACT_STATUS         str
SK_DPD                     int64
SK_DPD_DEF                 int64
dtype: object

metric: ROC#

Exploratory Data Analysis#

  • 发现有趣的现象和特征

Distribution of TARGET#

application_train['TARGET'].value_counts()[0] /application_train['TARGET'].value_counts().sum()
np.float64(0.9192711805431351)
application_train['TARGET'].value_counts().plot(kind='bar')
<Axes: xlabel='TARGET'>
../../_images/551055736298fad14bf5964281ca66773b3361f2c8939101a57822c6581ac3e3.png

可以看到,这是一个不平衡的分类,违约的样本只占大约8%

Missing Values#

  • 看看哪些字段缺失值较多

  • 后续 大部分模型都需要处理缺失值

def missing_values_table(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * mis_val / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
        columns={0: 'Missing Values', 1: '% of Total Values'}
        )
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns['Missing Values'] > 0
    ]
    mis_val_table_ren_columns = mis_val_table_ren_columns.sort_values(
        '% of Total Values', ascending=False
    ).round(1)
    print("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
          "There are " + str(mis_val_table_ren_columns[mis_val_table_ren_columns['Missing Values'] > 0].shape[0]) +
          " columns that have missing values.")
    return mis_val_table_ren_columns
missing_values_table(application_train).head(10)
Your selected dataframe has 122 columns.
There are 67 columns that have missing values.
Missing Values % of Total Values
COMMONAREA_MEDI 214865 69.9
COMMONAREA_MODE 214865 69.9
COMMONAREA_AVG 214865 69.9
NONLIVINGAPARTMENTS_MODE 213514 69.4
NONLIVINGAPARTMENTS_MEDI 213514 69.4
NONLIVINGAPARTMENTS_AVG 213514 69.4
FONDKAPREMONT_MODE 210295 68.4
LIVINGAPARTMENTS_AVG 210199 68.4
LIVINGAPARTMENTS_MEDI 210199 68.4
LIVINGAPARTMENTS_MODE 210199 68.4

column types#

  • int64, float64 数值特征

  • object 分类特征

application_train.dtypes.value_counts()
float64    65
int64      41
str        16
Name: count, dtype: int64
application_train.select_dtypes('object').apply(pd.Series.nunique, axis=0)
C:\Users\63517\AppData\Local\Temp\ipykernel_12096\3850001950.py:1: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning.
See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  application_train.select_dtypes('object').apply(pd.Series.nunique, axis=0)
NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

大多数类别变量都有比较少的分类值

Encoding categorical variable#

  • 大部分模型都需要预处理这些分类特征,编码为数字

    • label encodingone-hot encoding

  • 对二分类特征使用label encoding

  • 对多酚类特征使用one-hot encoding

    • 也可以使用pandas get_dummies(df) 更方便

le = LabelEncoder()
le_cnt = 0

for col in application_train:
    if application_train[col].dtype == 'object':
        if len(application_train[col].unique()) <= 2:
            # 二分类特征
            le.fit(application_train[col])
            application_train[col] = le.transform(application_train[col])
            application_test[col] = le.transform(application_test[col])
            le_cnt += 1
print('%d columns are label encoded.' % le_cnt)
0 columns are label encoded.
application_train = pd.get_dummies(application_train)
application_test = pd.get_dummies(application_test)
print(f"Training Features shape with one-hot : {application_train.shape}")
print(f"Testing Features shape with one-hot : {application_test.shape}")
Training Features shape with one-hot : (307511, 246)
Testing Features shape with one-hot : (48744, 242)
  • 这里特征数量翻了一倍!

  • 特征数量训练和测试也不匹配

align train and test features#

  • 对齐训练数据和测试数据特征,因为onehot

  • 采取inner交集的方式

# 保留下拉,inner会除掉
train_labels = application_train['TARGET'] 

application_train, application_test = application_train.align(application_test, join='inner', axis=1)
application_train['TARGET'] = train_labels
print('Training Features shape: ', application_train.shape)
print('Testing Features shape: ', application_test.shape)
Training Features shape:  (307511, 243)
Testing Features shape:  (48744, 242)
application_train.to_feather('checkpoints/01_train_app_base.feather')
application_test.to_feather('checkpoints/01_test_app_base.feather')

Anomalies 异常数据处理#

  • 一些异常的数据。 通过describe查看统计量筛选

(application_train['DAYS_BIRTH'] / -365).describe()
count    307511.000000
mean         43.936973
std          11.956133
min          20.517808
25%          34.008219
50%          43.150685
75%          53.923288
max          69.120548
Name: DAYS_BIRTH, dtype: float64

这看起来没什么异常

application_train['DAYS_EMPLOYED'].describe()
count    307511.000000
mean      63815.045904
std      141275.766519
min      -17912.000000
25%       -2760.000000
50%       -1213.000000
75%        -289.000000
max      365243.000000
Name: DAYS_EMPLOYED, dtype: float64

这明显不对,最大值是+36万天。我们通过分布频次直方图看一下

application_train['DAYS_EMPLOYED'].plot.hist()
plt.xlabel('Days employed')
Text(0.5, 0, 'Days employed')
../../_images/f2c2974b82b62d44988e1e9f0c8bb330eaa0429e98286381e9bcd911b00fa9b5.png

我们可以看到,

  • 右侧还是有很多人的异常数据集中分布

  • 由于max值很异常,导致左边的正常数据缩成一团

需要对这些异常数据的人观察,看他们target如何

anom = application_train[application_train['DAYS_EMPLOYED'] == 365243]
non_anom = application_train[application_train['DAYS_EMPLOYED'] != 365243]
print(f'The non-anom people 违约 on avg:{non_anom['TARGET'].mean() * 100:.2f}%')
print(f'The anom people 违约 on avg:{anom['TARGET'].mean() * 100:.2f}%')
print(f'There are {len(anom)} anomalous days of employment')
The non-anom people 违约 on avg:8.66%
The anom people 违约 on avg:5.40%
There are 55374 anomalous days of employment

这样看来,异常这些的人 违约率更低! 😊

  • 不能随便删除这些行!

  • 用空缺 np.nan代替,是个安全的方法

# 我们还创建了一个辅助的列 标识这个字段异常的行
application_train['DAYS_EMPLOYED_ANOM'] = application_train["DAYS_EMPLOYED"] == 365243

application_train['DAYS_EMPLOYED'] = application_train['DAYS_EMPLOYED'].replace({365243: np.nan})
application_train['DAYS_EMPLOYED'].plot.hist()
<Axes: ylabel='Frequency'>
../../_images/824e16b04fe9729b79c5e37269b5ff6df70c71c79c1e3bd7fe50c27aea9760c8.png

现在看起来好多了呐😊,此外,我们创建了一个列,表明这个字段最初是异常的。(后续可能会进行均值等代替)

对train的操作,也做到test上

application_test['DAYS_EMPLOYED_ANOM'] = application_test["DAYS_EMPLOYED"] == 365243

application_test['DAYS_EMPLOYED'] = application_test['DAYS_EMPLOYED'].replace({365243: np.nan})
application_test['DAYS_EMPLOYED'].plot.hist()
<Axes: ylabel='Frequency'>
../../_images/72951a2853118ab0c733419b7eda430f5bdfe6987948746ccbbb62fc4dc42e6f.png

correlations#

  • 类别特征处理后,现在都是数值列了,可以计算与target相关性

  • df.corr() 提供了快速的方法,计算相关系数

correlations = application_train.corr()
correlations = correlations['TARGET'].sort_values()
print('Top 5 positive features: \n', correlations.tail(5))
Top 5 positive features: 
 REGION_RATING_CLIENT           0.058899
REGION_RATING_CLIENT_W_CITY    0.060893
DAYS_EMPLOYED                  0.074958
DAYS_BIRTH                     0.078239
TARGET                         1.000000
Name: TARGET, dtype: float64
print('TOP 5 negative features: \n', correlations.head(5))
TOP 5 negative features: 
 EXT_SOURCE_3                           -0.178919
EXT_SOURCE_2                           -0.160472
EXT_SOURCE_1                           -0.155317
NAME_EDUCATION_TYPE_Higher education   -0.056593
CODE_GENDER_F                          -0.054704
Name: TARGET, dtype: float64

DAYS_BIRTH 年龄因素#

最正相关的是DAYS_BIRTHDAYS_EMPLOYED

  • 实际意义,因为是负值,所以实际是负相关的。!

    • 也就是说,年龄增长,违约风险越低

可以做绝对值看看

application_train['DAYS_BIRTH'] = abs(application_train['DAYS_BIRTH'])
application_train['DAYS_BIRTH'].corr(application_train['TARGET'])
np.float64(-0.07823930830982709)

我们可以认为,年龄越大,违约风险越低吗?

年龄分布直方图

plt.style.use('fivethirtyeight')
plt.hist(application_train['DAYS_BIRTH'] / 365, bins=25, edgecolor='k')
plt.title('age distribution')
plt.xlabel('AGE')
Text(0.5, 0, 'AGE')
../../_images/8e0314533902f0b9e6b2a0d8e24caffc523322910220fbdef83cc01734caa6fd.png

年龄分布是合理的。 我们看下分开target的平滑直方图

sns.kdeplot(
    application_train.loc[application_train['TARGET']==0, 'DAYS_BIRTH'] / 365, 
    label = 'target=0'
)
sns.kdeplot(
    application_train.loc[application_train['TARGET']==1, 'DAYS_BIRTH'] / 365, 
    label = 'target=1'
)
plt.xlabel('Age (years)')
plt.title('Age distribution ')
plt.legend()
<matplotlib.legend.Legend at 0x1229cf8aba0>
../../_images/88f2ed67cb5e1c3e5b99e108b283de66076419f0fe0db8301d332a759802d056.png

可以看到, target=1的曲线,明显倾向于年轻人。

换个角度,从年龄段看看,看看每个年龄段平均违约率。 条形图

age_data = application_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365
# bin age data
age_data['YEARS_BINNED'] = pd.cut(
    age_data['YEARS_BIRTH'],
    bins = np.linspace(20, 70, num=11),
)
age_groups = age_data.groupby('YEARS_BINNED').mean()
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])
plt.xticks(rotation=75)
plt.xlabel('Age group')
plt.ylabel('违约概率 %')
plt.title('违约概率年龄分布')
Text(0.5, 1.0, '违约概率年龄分布')
../../_images/1d5a02134f8ac12ec9e2ed927592b6b56ef372ad049927c51b9e7eae0bee5e7b.png

确实是这样,越年轻的客户更容易违约

Exterior Sources#

EXIT_SOURCE_1..3是最负相关的三个特征

ext_data = application_train[['TARGET', 'EXT_SOURCE_1',  'EXT_SOURCE_2', 'EXT_SOURCE_3']]
ext_data_corrs = ext_data.corr()
ext_data_corrs
TARGET EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3
TARGET 1.000000 -0.155317 -0.160472 -0.178919
EXT_SOURCE_1 -0.155317 1.000000 0.213982 0.186846
EXT_SOURCE_2 -0.160472 0.213982 1.000000 0.109167
EXT_SOURCE_3 -0.178919 0.186846 0.109167 1.000000
sns.heatmap(
    ext_data_corrs,
    cmap = plt.cm.RdYlBu_r,
    vmin = -0.25,
    vmax = 0.5,
    annot = True
)
plt.title('correlations map')
Text(0.5, 1.0, 'correlations map')
../../_images/685d2574bc70562796c7abea7fd900442ad22140a0ae8a5182bd4c61fe17c4b5.png

观察下不同target下的分布

plt.figure(figsize = (6,6))
for i, source in enumerate(['EXT_SOURCE_1',  'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    plt.subplot(3, 1 , i + 1)
    sns.kdeplot(
        application_train.loc[application_train['TARGET'] == 0, source],
        label = 'target = 0'
    )
    sns.kdeplot(
        application_train.loc[application_train['TARGET'] == 1, source],
        label = 'target = 1'
    )
    plt.title(f'Distribution of {source} by target')
plt.tight_layout()
../../_images/ed118b1923b33c30416f96c0512484619ee7cfcaa218f9d187ddc68cfd91590e.png

可以看到,相对而言,EXIT_SOURCE_3EXIT_SOURCE_1可能对target有比较强的关联

pairs plot#

  • 我们找到了DAYS_BIRTH EXIT_SOUCE_ 与目标最相关的特征,但对于这两特征之间,我们还要探索

  • 通过pairs plot 可以探索 两两变量的关系 和 单变量的分布

plot_data = application_train[['TARGET', 'DAYS_BIRTH', 'EXT_SOURCE_3', 'EXT_SOURCE_1']]
plot_data = plot_data.dropna().loc[:1000, :]
def plot_crr(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca() # 获取当前坐标轴
    ax.annotate(f"r = {r:.2f}", xy=(0.5, 0.5))
grid = sns.PairGrid(data = plot_data, height = 3, diag_sharey=False,
                    hue = 'TARGET', 
                    vars = [x for x in list(plot_data.columns) if x != 'TARGET'])

grid.map_upper(plt.scatter, alpha = 0.2)
grid.map_diag(sns.kdeplot)
grid.map_lower(sns.kdeplot);
plt.legend()
C:\Users\63517\AppData\Local\Temp\ipykernel_12096\2750256857.py:8: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  plt.legend()
<matplotlib.legend.Legend at 0x121c717df90>
../../_images/55e870c0d48e3a4b200adb11b1fb60c262e06d19bb5e055e030824ef6807ab9b.png

注意:由于TARGET分布是不均匀的,所以这里看起来怪怪的

特征工程#

  • 胜负关键就来自于特征工程, 获胜模型往往是梯度提升变体

  • 特征工程比模型构建和超参数调优具有更高的投资回报率。

  • 特征工程:就是 构建新特征、选择特征

简单的特征构造方法:

  • Polynomial features 多项式特征

  • Domain knowledge features 领域知识特性

application_train = pd.read_feather('checkpoints/01_train_app_base.feather')
application_test = pd.read_feather('checkpoints/01_test_app_base.feather')

Polynomial Features 多项式特征#

poly_features = application_train[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3', 'DAYS_BIRTH']]
poly_features_test = application_test[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3', 'DAYS_BIRTH']]
poly_target = application_train['TARGET']
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'median')
poly_features = imputer.fit_transform(poly_features)
poly_features_test = imputer.fit_transform(poly_features_test)
from sklearn.preprocessing import  PolynomialFeatures
poly_transformer = PolynomialFeatures(degree=3)
poly_transformer.fit(poly_features)
PolynomialFeatures(degree=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial features shape: ', poly_features.shape)
Polynomial features shape:  (307511, 35)
poly_transformer.get_feature_names_out()
array(['1', 'x0', 'x1', 'x2', 'x3', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3',
       'x1^2', 'x1 x2', 'x1 x3', 'x2^2', 'x2 x3', 'x3^2', 'x0^3',
       'x0^2 x1', 'x0^2 x2', 'x0^2 x3', 'x0 x1^2', 'x0 x1 x2', 'x0 x1 x3',
       'x0 x2^2', 'x0 x2 x3', 'x0 x3^2', 'x1^3', 'x1^2 x2', 'x1^2 x3',
       'x1 x2^2', 'x1 x2 x3', 'x1 x3^2', 'x2^3', 'x2^2 x3', 'x2 x3^2',
       'x3^3'], dtype=object)

现在我们 看下生成的新特征与target关系如何?

poly_features = pd.DataFrame(
    poly_features,
    columns = poly_transformer.get_feature_names_out(['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3', 'DAYS_BIRTH'])
)
poly_features
1 EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH EXT_SOURCE_1^2 EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_1 EXT_SOURCE_3 EXT_SOURCE_1 DAYS_BIRTH EXT_SOURCE_2^2 ... EXT_SOURCE_2^3 EXT_SOURCE_2^2 EXT_SOURCE_3 EXT_SOURCE_2^2 DAYS_BIRTH EXT_SOURCE_2 EXT_SOURCE_3^2 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH EXT_SOURCE_2 DAYS_BIRTH^2 EXT_SOURCE_3^3 EXT_SOURCE_3^2 DAYS_BIRTH EXT_SOURCE_3 DAYS_BIRTH^2 DAYS_BIRTH^3
0 1.0 0.083037 0.262949 0.139376 9461.0 0.006895 0.021834 0.011573 785.612748 0.069142 ... 0.018181 0.009637 654.152107 0.005108 346.733022 2.353667e+07 0.002707 183.785678 1.247560e+07 8.468590e+11
1 1.0 0.311267 0.622246 0.535276 16765.0 0.096887 0.193685 0.166614 5218.396475 0.387190 ... 0.240927 0.207254 6491.237078 0.178286 5583.975307 1.748916e+08 0.153368 4803.518937 1.504475e+08 4.712058e+12
2 1.0 0.505998 0.555912 0.729567 19046.0 0.256034 0.281290 0.369159 9637.236584 0.309038 ... 0.171798 0.225464 5885.942404 0.295894 7724.580288 2.016572e+08 0.388325 10137.567875 2.646504e+08 6.908939e+12
3 1.0 0.505998 0.650442 0.535276 19005.0 0.256034 0.329122 0.270849 9616.490669 0.423074 ... 0.275185 0.226462 8040.528832 0.186365 6616.894625 2.349331e+08 0.153368 5445.325225 1.933364e+08 6.864416e+12
4 1.0 0.505998 0.322738 0.535276 19932.0 0.256034 0.163305 0.270849 10085.550751 0.104160 ... 0.033616 0.055754 2076.117157 0.092471 3443.335521 1.282190e+08 0.153368 5710.929881 2.126570e+08 7.918677e+12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 1.0 0.145570 0.681632 0.535276 9327.0 0.021191 0.099226 0.077920 1357.735625 0.464623 ... 0.316702 0.248701 4333.535804 0.195302 3403.064320 5.929720e+07 0.153368 2672.378236 4.656525e+07 8.113830e+11
307507 1.0 0.505998 0.115992 0.535276 20775.0 0.256034 0.058692 0.270849 10512.107006 0.013454 ... 0.001561 0.007202 279.510194 0.033234 1289.874083 5.006225e+07 0.153368 5952.466801 2.310256e+08 8.966503e+12
307508 1.0 0.744026 0.535722 0.218859 14966.0 0.553575 0.398591 0.162837 11135.099105 0.286998 ... 0.153751 0.062812 4295.209004 0.025661 1754.727146 1.199916e+08 0.010483 716.860892 4.902031e+07 3.352102e+12
307509 1.0 0.505998 0.514163 0.661024 11961.0 0.256034 0.260165 0.334477 6052.241247 0.264363 ... 0.135926 0.174750 3162.050698 0.224665 4065.229651 7.355897e+07 0.288836 5226.384299 9.456968e+07 1.711207e+12
307510 1.0 0.734460 0.708569 0.113922 16856.0 0.539431 0.520415 0.083671 12380.052173 0.502070 ... 0.355751 0.057197 8462.889915 0.009196 1360.647784 2.013220e+08 0.001479 218.762433 3.236817e+07 4.789207e+12

307511 rows × 35 columns

poly_features['TARGET'] = poly_target
poly_corrs = poly_features.corr()['TARGET'].sort_values()
print('TOP 5 positive feature\n', poly_corrs.head(5))
print('TOP 5 negetive feature\n', poly_corrs.tail(5))
TOP 5 positive feature
 EXT_SOURCE_2 EXT_SOURCE_3                -0.193939
EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3   -0.189605
EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH     -0.181283
EXT_SOURCE_2^2 EXT_SOURCE_3              -0.176428
EXT_SOURCE_2 EXT_SOURCE_3^2              -0.172282
Name: TARGET, dtype: float64
TOP 5 negetive feature
 DAYS_BIRTH     -0.078239
DAYS_BIRTH^2   -0.076672
DAYS_BIRTH^3   -0.074273
TARGET          1.000000
1                    NaN
Name: TARGET, dtype: float64

可以看到,一些新的特征与target 相关性大于原特征, 我们可以试着采用他

poly_features_test = pd.DataFrame(
    poly_features_test,
    columns = poly_transformer.get_feature_names_out(['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3', 'DAYS_BIRTH'])
)
poly_features_test
1 EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH EXT_SOURCE_1^2 EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_1 EXT_SOURCE_3 EXT_SOURCE_1 DAYS_BIRTH EXT_SOURCE_2^2 ... EXT_SOURCE_2^3 EXT_SOURCE_2^2 EXT_SOURCE_3 EXT_SOURCE_2^2 DAYS_BIRTH EXT_SOURCE_2 EXT_SOURCE_3^2 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH EXT_SOURCE_2 DAYS_BIRTH^2 EXT_SOURCE_3^3 EXT_SOURCE_3^2 DAYS_BIRTH EXT_SOURCE_3 DAYS_BIRTH^2 DAYS_BIRTH^3
0 1.0 0.752614 0.789654 0.159520 -19241.0 0.566429 0.594305 0.120057 -14481.055414 0.623554 ... 0.492392 0.099469 -11997.802403 0.020094 -2423.698322 2.923427e+08 0.004059 -489.615795 5.905670e+07 -7.123328e+12
1 1.0 0.564990 0.291656 0.432962 -18064.0 0.319214 0.164783 0.244619 -10205.983005 0.085063 ... 0.024809 0.036829 -1536.577117 0.054673 -2281.043619 9.516956e+07 0.081161 -3386.201665 1.412789e+08 -5.894429e+12
2 1.0 0.506771 0.699787 0.610991 -20038.0 0.256817 0.354632 0.309633 -10154.682538 0.489702 ... 0.342687 0.299203 -9812.640816 0.261238 -8567.521115 2.809794e+08 0.228089 -7480.393855 2.453261e+08 -8.045687e+12
3 1.0 0.525734 0.509677 0.612704 -13976.0 0.276396 0.267955 0.322119 -7347.658072 0.259771 ... 0.132399 0.159163 -3630.555667 0.191336 -4364.443591 9.955450e+07 0.230013 -5246.681115 1.196786e+08 -2.729912e+12
4 1.0 0.202145 0.425687 0.519097 -13040.0 0.040863 0.086051 0.104933 -2635.970697 0.181210 ... 0.077139 0.094065 -2362.974127 0.114707 -2881.489762 7.238455e+07 0.139877 -3513.785087 8.826814e+07 -2.217342e+12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48739 1.0 0.506771 0.648575 0.643026 -19970.0 0.256817 0.328679 0.325867 -10120.222092 0.420649 ... 0.272823 0.270488 -8400.368742 0.268174 -8328.493414 2.586523e+08 0.265879 -8257.233066 2.564392e+08 -7.964054e+12
48740 1.0 0.506771 0.684596 0.519097 -11186.0 0.256817 0.346933 0.263064 -5668.743331 0.468671 ... 0.320850 0.243286 -5242.555692 0.184473 -3975.188577 8.566112e+07 0.139877 -3014.202453 6.495288e+07 -1.399666e+12
48741 1.0 0.733503 0.632770 0.283712 -15922.0 0.538027 0.464139 0.208104 -11678.842724 0.400397 ... 0.253359 0.113597 -6375.125880 0.050933 -2858.384957 1.604135e+08 0.022837 -1281.600508 7.192382e+07 -4.036388e+12
48742 1.0 0.373090 0.445701 0.595456 -13968.0 0.139196 0.166287 0.222159 -5211.322249 0.198649 ... 0.088538 0.118287 -2774.734348 0.158031 -3707.043157 8.695850e+07 0.211130 -4952.607075 1.161765e+08 -2.725227e+12
48743 1.0 0.506771 0.456541 0.272134 -13962.0 0.256817 0.231362 0.137910 -7075.540353 0.208429 ... 0.095156 0.056721 -2910.091018 0.033810 -1734.640191 8.899687e+07 0.020153 -1033.980235 5.304904e+07 -2.721717e+12

48744 rows × 35 columns

poly_features['SK_ID_CURR'] = application_train['SK_ID_CURR']
application_train_poly = application_train.merge(poly_features, on = 'SK_ID_CURR', how='left')

poly_features_test['SK_ID_CURR'] = application_test['SK_ID_CURR']
application_test_poly = application_test.merge(poly_features_test, on = 'SK_ID_CURR', how='left')
print(application_train_poly.shape, application_test_poly.shape)
(307511, 280) (48744, 278)

对齐一下特征

application_train_poly, application_test_poly = application_train_poly.align(
    application_test_poly, join='inner',axis = 1
)
print(application_train_poly.shape, application_test_poly.shape)
(307511, 278) (48744, 278)

Domain Knowledge Features#

  • 领域知识:就是利用对业务的理解,手动构造一些特征。

    • 比如信贷:原始数据:月收入、月还款额。领域知识特征:“负债率”(月还款 / 月收入)。

我们得到了一些领域知识特征:

  • CREDIT_INCOME_PERCENT : 贷款金额 与 收入 百分比

  • ANNUITY_INCOME_PERCENT: 月还款 与 收入百分比

  • CREDIT_TERM: 支付期数

  • DAYS_EMPLOYED_PERCENT: 工作天数 与 年龄 百分比

print(application_train.shape)
(307511, 244)
application_train_domain = application_train.copy()
application_test_domain = application_test.copy()
application_train_domain['CREDIT_INCOME_PERCENT'] = application_train_domain['AMT_CREDIT']/application_train_domain['AMT_INCOME_TOTAL']
application_train_domain['ANNUITY_INCOME_PERCENT'] = application_train_domain['AMT_ANNUITY'] / application_train_domain['AMT_INCOME_TOTAL']
application_train_domain['CREDIT_TERM'] = application_train_domain['AMT_ANNUITY'] / application_train_domain['AMT_CREDIT']
application_train_domain['DAYS_EMPLOYED_PERCENT'] = application_train_domain['DAYS_EMPLOYED'] / application_train_domain['DAYS_BIRTH']
application_test_domain['CREDIT_INCOME_PERCENT'] = application_test_domain['AMT_CREDIT']/application_test_domain['AMT_INCOME_TOTAL']
application_test_domain['ANNUITY_INCOME_PERCENT'] = application_test_domain['AMT_ANNUITY'] / application_test_domain['AMT_INCOME_TOTAL']
application_test_domain['CREDIT_TERM'] = application_test_domain['AMT_ANNUITY'] / application_train_domain['AMT_CREDIT']
application_test_domain['DAYS_EMPLOYED_PERCENT'] = application_test_domain['DAYS_EMPLOYED'] / application_test_domain['DAYS_BIRTH']

我们来看下这几个新变量和目标的相关性

plt.figure(figsize=(12,20))
for i, feature in enumerate(['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT']):
    plt.subplot(4,1, i+1)
    sns.kdeplot(application_train_domain.loc[application_train_domain['TARGET'] == 0, feature], label = 'target = 0')
    sns.kdeplot(application_train_domain.loc[application_train_domain['TARGET'] == 1, feature], label = 'target = 1')
    plt.title(f'distribution of {feature} by target')
../../_images/6178dafa76b2d4c141f1ed4691d8e32066b0cbf7bee96b868934ef7dddda9d2a.png

可以看到, 对于每个新特征而言, 违约的人和不违约的没什么区别。 🙌 没感觉他有啥用 就是亲自试试呢~

baseline#

作为baseline:

  • 使用了全特征

  • 填充缺失值

  • 数据归一化,统一量度

print(application_train.shape)
list(application_train.columns)
(307511, 244)
['SK_ID_CURR',
 'CNT_CHILDREN',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'AMT_ANNUITY',
 'AMT_GOODS_PRICE',
 'REGION_POPULATION_RELATIVE',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_REGISTRATION',
 'DAYS_ID_PUBLISH',
 'OWN_CAR_AGE',
 'FLAG_MOBIL',
 'FLAG_EMP_PHONE',
 'FLAG_WORK_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_PHONE',
 'FLAG_EMAIL',
 'CNT_FAM_MEMBERS',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'HOUR_APPR_PROCESS_START',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BEGINEXPLUATATION_AVG',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'ELEVATORS_AVG',
 'ENTRANCES_AVG',
 'FLOORSMAX_AVG',
 'FLOORSMIN_AVG',
 'LANDAREA_AVG',
 'LIVINGAPARTMENTS_AVG',
 'LIVINGAREA_AVG',
 'NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAREA_AVG',
 'APARTMENTS_MODE',
 'BASEMENTAREA_MODE',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BUILD_MODE',
 'COMMONAREA_MODE',
 'ELEVATORS_MODE',
 'ENTRANCES_MODE',
 'FLOORSMAX_MODE',
 'FLOORSMIN_MODE',
 'LANDAREA_MODE',
 'LIVINGAPARTMENTS_MODE',
 'LIVINGAREA_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'NONLIVINGAREA_MODE',
 'APARTMENTS_MEDI',
 'BASEMENTAREA_MEDI',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'YEARS_BUILD_MEDI',
 'COMMONAREA_MEDI',
 'ELEVATORS_MEDI',
 'ENTRANCES_MEDI',
 'FLOORSMAX_MEDI',
 'FLOORSMIN_MEDI',
 'LANDAREA_MEDI',
 'LIVINGAPARTMENTS_MEDI',
 'LIVINGAREA_MEDI',
 'NONLIVINGAPARTMENTS_MEDI',
 'NONLIVINGAREA_MEDI',
 'TOTALAREA_MODE',
 'OBS_30_CNT_SOCIAL_CIRCLE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'DAYS_LAST_PHONE_CHANGE',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'NAME_CONTRACT_TYPE_Cash loans',
 'NAME_CONTRACT_TYPE_Revolving loans',
 'CODE_GENDER_F',
 'CODE_GENDER_M',
 'FLAG_OWN_CAR_N',
 'FLAG_OWN_CAR_Y',
 'FLAG_OWN_REALTY_N',
 'FLAG_OWN_REALTY_Y',
 'NAME_TYPE_SUITE_Children',
 'NAME_TYPE_SUITE_Family',
 'NAME_TYPE_SUITE_Group of people',
 'NAME_TYPE_SUITE_Other_A',
 'NAME_TYPE_SUITE_Other_B',
 'NAME_TYPE_SUITE_Spouse, partner',
 'NAME_TYPE_SUITE_Unaccompanied',
 'NAME_INCOME_TYPE_Businessman',
 'NAME_INCOME_TYPE_Commercial associate',
 'NAME_INCOME_TYPE_Pensioner',
 'NAME_INCOME_TYPE_State servant',
 'NAME_INCOME_TYPE_Student',
 'NAME_INCOME_TYPE_Unemployed',
 'NAME_INCOME_TYPE_Working',
 'NAME_EDUCATION_TYPE_Academic degree',
 'NAME_EDUCATION_TYPE_Higher education',
 'NAME_EDUCATION_TYPE_Incomplete higher',
 'NAME_EDUCATION_TYPE_Lower secondary',
 'NAME_EDUCATION_TYPE_Secondary / secondary special',
 'NAME_FAMILY_STATUS_Civil marriage',
 'NAME_FAMILY_STATUS_Married',
 'NAME_FAMILY_STATUS_Separated',
 'NAME_FAMILY_STATUS_Single / not married',
 'NAME_FAMILY_STATUS_Widow',
 'NAME_HOUSING_TYPE_Co-op apartment',
 'NAME_HOUSING_TYPE_House / apartment',
 'NAME_HOUSING_TYPE_Municipal apartment',
 'NAME_HOUSING_TYPE_Office apartment',
 'NAME_HOUSING_TYPE_Rented apartment',
 'NAME_HOUSING_TYPE_With parents',
 'OCCUPATION_TYPE_Accountants',
 'OCCUPATION_TYPE_Cleaning staff',
 'OCCUPATION_TYPE_Cooking staff',
 'OCCUPATION_TYPE_Core staff',
 'OCCUPATION_TYPE_Drivers',
 'OCCUPATION_TYPE_HR staff',
 'OCCUPATION_TYPE_High skill tech staff',
 'OCCUPATION_TYPE_IT staff',
 'OCCUPATION_TYPE_Laborers',
 'OCCUPATION_TYPE_Low-skill Laborers',
 'OCCUPATION_TYPE_Managers',
 'OCCUPATION_TYPE_Medicine staff',
 'OCCUPATION_TYPE_Private service staff',
 'OCCUPATION_TYPE_Realty agents',
 'OCCUPATION_TYPE_Sales staff',
 'OCCUPATION_TYPE_Secretaries',
 'OCCUPATION_TYPE_Security staff',
 'OCCUPATION_TYPE_Waiters/barmen staff',
 'WEEKDAY_APPR_PROCESS_START_FRIDAY',
 'WEEKDAY_APPR_PROCESS_START_MONDAY',
 'WEEKDAY_APPR_PROCESS_START_SATURDAY',
 'WEEKDAY_APPR_PROCESS_START_SUNDAY',
 'WEEKDAY_APPR_PROCESS_START_THURSDAY',
 'WEEKDAY_APPR_PROCESS_START_TUESDAY',
 'WEEKDAY_APPR_PROCESS_START_WEDNESDAY',
 'ORGANIZATION_TYPE_Advertising',
 'ORGANIZATION_TYPE_Agriculture',
 'ORGANIZATION_TYPE_Bank',
 'ORGANIZATION_TYPE_Business Entity Type 1',
 'ORGANIZATION_TYPE_Business Entity Type 2',
 'ORGANIZATION_TYPE_Business Entity Type 3',
 'ORGANIZATION_TYPE_Cleaning',
 'ORGANIZATION_TYPE_Construction',
 'ORGANIZATION_TYPE_Culture',
 'ORGANIZATION_TYPE_Electricity',
 'ORGANIZATION_TYPE_Emergency',
 'ORGANIZATION_TYPE_Government',
 'ORGANIZATION_TYPE_Hotel',
 'ORGANIZATION_TYPE_Housing',
 'ORGANIZATION_TYPE_Industry: type 1',
 'ORGANIZATION_TYPE_Industry: type 10',
 'ORGANIZATION_TYPE_Industry: type 11',
 'ORGANIZATION_TYPE_Industry: type 12',
 'ORGANIZATION_TYPE_Industry: type 13',
 'ORGANIZATION_TYPE_Industry: type 2',
 'ORGANIZATION_TYPE_Industry: type 3',
 'ORGANIZATION_TYPE_Industry: type 4',
 'ORGANIZATION_TYPE_Industry: type 5',
 'ORGANIZATION_TYPE_Industry: type 6',
 'ORGANIZATION_TYPE_Industry: type 7',
 'ORGANIZATION_TYPE_Industry: type 8',
 'ORGANIZATION_TYPE_Industry: type 9',
 'ORGANIZATION_TYPE_Insurance',
 'ORGANIZATION_TYPE_Kindergarten',
 'ORGANIZATION_TYPE_Legal Services',
 'ORGANIZATION_TYPE_Medicine',
 'ORGANIZATION_TYPE_Military',
 'ORGANIZATION_TYPE_Mobile',
 'ORGANIZATION_TYPE_Other',
 'ORGANIZATION_TYPE_Police',
 'ORGANIZATION_TYPE_Postal',
 'ORGANIZATION_TYPE_Realtor',
 'ORGANIZATION_TYPE_Religion',
 'ORGANIZATION_TYPE_Restaurant',
 'ORGANIZATION_TYPE_School',
 'ORGANIZATION_TYPE_Security',
 'ORGANIZATION_TYPE_Security Ministries',
 'ORGANIZATION_TYPE_Self-employed',
 'ORGANIZATION_TYPE_Services',
 'ORGANIZATION_TYPE_Telecom',
 'ORGANIZATION_TYPE_Trade: type 1',
 'ORGANIZATION_TYPE_Trade: type 2',
 'ORGANIZATION_TYPE_Trade: type 3',
 'ORGANIZATION_TYPE_Trade: type 4',
 'ORGANIZATION_TYPE_Trade: type 5',
 'ORGANIZATION_TYPE_Trade: type 6',
 'ORGANIZATION_TYPE_Trade: type 7',
 'ORGANIZATION_TYPE_Transport: type 1',
 'ORGANIZATION_TYPE_Transport: type 2',
 'ORGANIZATION_TYPE_Transport: type 3',
 'ORGANIZATION_TYPE_Transport: type 4',
 'ORGANIZATION_TYPE_University',
 'ORGANIZATION_TYPE_XNA',
 'FONDKAPREMONT_MODE_not specified',
 'FONDKAPREMONT_MODE_org spec account',
 'FONDKAPREMONT_MODE_reg oper account',
 'FONDKAPREMONT_MODE_reg oper spec account',
 'HOUSETYPE_MODE_block of flats',
 'HOUSETYPE_MODE_specific housing',
 'HOUSETYPE_MODE_terraced house',
 'WALLSMATERIAL_MODE_Block',
 'WALLSMATERIAL_MODE_Mixed',
 'WALLSMATERIAL_MODE_Monolithic',
 'WALLSMATERIAL_MODE_Others',
 'WALLSMATERIAL_MODE_Panel',
 'WALLSMATERIAL_MODE_Stone, brick',
 'WALLSMATERIAL_MODE_Wooden',
 'EMERGENCYSTATE_MODE_No',
 'EMERGENCYSTATE_MODE_Yes',
 'TARGET',
 'DAYS_EMPLOYED_ANOM']
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
def impute_and_scaler(train, test):
    features = list(train.columns)
    imputer = SimpleImputer(strategy='median')
    scaler = MinMaxScaler(feature_range=(0,1))

    imputer.fit(train)
    train = imputer.transform(train)
    test = imputer.transform(test)

    scaler.fit(train)
    train = scaler.transform(train)
    test = scaler.transform(test)

    return train, test
train = application_train.copy()
train_labels = train['TARGET']
train = train.drop(columns = ['TARGET'])
test = application_test.copy()

logistic regression#

from sklearn.linear_model import LogisticRegression
log_regress_model = LogisticRegression(C=0.001)
log_regress_model.fit(train, train_labels)
LogisticRegression(C=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
log_regress_model_pred = log_regress_model.predict_proba(test)
log_regress_model_pred = log_regress_model_pred[:, 1]

得到了违约的概率

submit = application_test[['SK_ID_CURR']]
submit['TARGET'] = log_regress_model_pred
submit.head()
SK_ID_CURR TARGET
0 100001 0.094349
1 100005 0.269978
2 100013 0.069365
3 100028 0.082479
4 100038 0.178499
submit.to_csv('data/log_regress_model_baseline.csv', index = False)

我们得到了 71%得分

改进:random forest#

from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(
    n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1
)
random_forest_model.fit(train, train_labels)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
random_forest_model_pred = random_forest_model.predict_proba(test)[:, 1]
[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.1s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.3s finished
submit = application_test[['SK_ID_CURR']]
submit['TARGET'] = random_forest_model_pred
submit.to_csv('data/random_forest_baseline.csv', index=False)
submit
SK_ID_CURR TARGET
0 100001 0.15
1 100005 0.15
2 100013 0.09
3 100028 0.15
4 100038 0.21
... ... ...
48739 456221 0.12
48740 456222 0.16
48741 456223 0.22
48742 456224 0.14
48743 456250 0.19

48744 rows × 2 columns

得分68.5

改进:random forest with 多项式特征工程#

poly_features_names = list(application_train_poly.columns)
application_train_poly.shape
(307511, 278)
imputer = SimpleImputer(strategy='median')
scaler = MinMaxScaler(feature_range=(0,1))

poly_features = imputer.fit_transform(application_train_poly)
poly_features_test = imputer.transform(application_test_poly)

poly_features = scaler.fit_transform(poly_features)
poly_features_test = scaler.transform(poly_features_test)
random_forest_poly = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)
random_forest_poly.fit(poly_features, train_labels)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   12.8s
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[84], line 2
      1 random_forest_poly = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)
----> 2 random_forest_poly.fit(poly_features, train_labels)

File c:\Users\63517\miniconda3\envs\data-analysis\Lib\site-packages\sklearn\base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1358     estimator._validate_params()
   1360 with config_context(
   1361     skip_parameter_validation=(
   1362         prefer_skip_nested_validation or global_skip_validation
   1363     )
   1364 ):
-> 1365     return fit_method(estimator, *args, **kwargs)

File c:\Users\63517\miniconda3\envs\data-analysis\Lib\site-packages\sklearn\ensemble\_forest.py:486, in BaseForest.fit(self, X, y, sample_weight)
    475 trees = [
    476     self._make_estimator(append=False, random_state=random_state)
    477     for i in range(n_more_estimators)
    478 ]
    480 # Parallel loop: we prefer the threading backend as the Cython code
    481 # for fitting the trees is internally releasing the Python GIL
    482 # making threading more efficient than multiprocessing in
    483 # that case. However, for joblib 0.12+ we respect any
    484 # parallel_backend contexts set at a higher level,
    485 # since correctness does not rely on using threads.
--> 486 trees = Parallel(
    487     n_jobs=self.n_jobs,
    488     verbose=self.verbose,
    489     prefer="threads",
    490 )(
    491     delayed(_parallel_build_trees)(
    492         t,
    493         self.bootstrap,
    494         X,
    495         y,
    496         sample_weight,
    497         i,
    498         len(trees),
    499         verbose=self.verbose,
    500         class_weight=self.class_weight,
    501         n_samples_bootstrap=n_samples_bootstrap,
    502         missing_values_in_feature_mask=missing_values_in_feature_mask,
    503     )
    504     for i, t in enumerate(trees)
    505 )
    507 # Collect newly grown trees
    508 self.estimators_.extend(trees)

File c:\Users\63517\miniconda3\envs\data-analysis\Lib\site-packages\sklearn\utils\parallel.py:82, in Parallel.__call__(self, iterable)
     73 warning_filters = warnings.filters
     74 iterable_with_config_and_warning_filters = (
     75     (
     76         _with_config_and_warning_filters(delayed_func, config, warning_filters),
   (...)     80     for delayed_func, args, kwargs in iterable
     81 )
---> 82 return super().__call__(iterable_with_config_and_warning_filters)

File c:\Users\63517\miniconda3\envs\data-analysis\Lib\site-packages\joblib\parallel.py:2072, in Parallel.__call__(self, iterable)
   2066 # The first item from the output is blank, but it makes the interpreter
   2067 # progress until it enters the Try/Except block of the generator and
   2068 # reaches the first `yield` statement. This starts the asynchronous
   2069 # dispatch of the tasks to the workers.
   2070 next(output)
-> 2072 return output if self.return_generator else list(output)

File c:\Users\63517\miniconda3\envs\data-analysis\Lib\site-packages\joblib\parallel.py:1682, in Parallel._get_outputs(self, iterator, pre_dispatch)
   1679     yield
   1681     with self._backend.retrieval_context():
-> 1682         yield from self._retrieve()
   1684 except GeneratorExit:
   1685     # The generator has been garbage collected before being fully
   1686     # consumed. This aborts the remaining tasks if possible and warn
   1687     # the user if necessary.
   1688     self._exception = True

File c:\Users\63517\miniconda3\envs\data-analysis\Lib\site-packages\joblib\parallel.py:1800, in Parallel._retrieve(self)
   1789 if self.return_ordered:
   1790     # Case ordered: wait for completion (or error) of the next job
   1791     # that have been dispatched and not retrieved yet. If no job
   (...)   1795     # control only have to be done on the amount of time the next
   1796     # dispatched job is pending.
   1797     if (nb_jobs == 0) or (
   1798         self._jobs[0].get_status(timeout=self.timeout) == TASK_PENDING
   1799     ):
-> 1800         time.sleep(0.01)
   1801         continue
   1803 elif nb_jobs == 0:
   1804     # Case unordered: jobs are added to the list of jobs to
   1805     # retrieve `self._jobs` only once completed or in error, which
   (...)   1811     # timeouts before any other dispatched job has completed and
   1812     # been added to `self._jobs` to be retrieved.

KeyboardInterrupt: 
random_forest_poly_pred = random_forest_poly.predict_proba(poly_features_test)[:, 1]
[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished
submit = application_test[['SK_ID_CURR']]
submit['TARGET'] = random_forest_poly_pred

submit.to_csv('random_forest_baseline_engineered.csv', index = False)

得分65,特征工程并没有起到作用

改进:random forest with 领域知识特征工程#

application_train_domain.columns
Index(['SK_ID_CURR', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE',
       'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION',
       ...
       'WALLSMATERIAL_MODE_Stone, brick', 'WALLSMATERIAL_MODE_Wooden',
       'EMERGENCYSTATE_MODE_No', 'EMERGENCYSTATE_MODE_Yes', 'TARGET',
       'DAYS_EMPLOYED_ANOM', 'CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT',
       'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT'],
      dtype='str', length=248)
application_train_domain = application_train_domain.drop(columns=['TARGET'])
imputer = SimpleImputer(strategy='median')
scaler = MinMaxScaler(feature_range=(0,1))

domain_features = imputer.fit_transform(application_train_domain)
domain_features_test = imputer.transform(application_test_domain)

domain_features = scaler.fit_transform(domain_features)
domain_features_test = scaler.transform(domain_features_test)
random_forest_domain = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)
random_forest_domain.fit(domain_features, train_labels)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    8.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   29.8s finished
RandomForestClassifier(n_jobs=-1, random_state=50, verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
random_forest_domain_pred = random_forest_domain.predict_proba(domain_features_test)[:, 1]
[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.1s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.2s finished
submit = application_test[['SK_ID_CURR']]
submit['TARGET'] = random_forest_domain_pred

submit.to_csv('random_forest_baseline_domain_engineered.csv', index = False)

得分65, 没什么改变呢

random forest 模型解释#

  • 我们之前预期到EXT_SOURCEDAYS_BIRTH是最重要的,

feature_importances = random_forest_model.feature_importances_
feature_importances = pd.DataFrame({
    'feature': features,
    'importance':feature_importances
    }
)
feature_importances.sort_values(by='importance', ascending=False)
feature importance
29 EXT_SOURCE_2 0.049029
30 EXT_SOURCE_3 0.046400
10 DAYS_ID_PUBLISH 0.031736
7 DAYS_BIRTH 0.031524
9 DAYS_REGISTRATION 0.031100
... ... ...
12 FLAG_MOBIL 0.000000
89 FLAG_DOCUMENT_12 0.000000
87 FLAG_DOCUMENT_10 0.000000
81 FLAG_DOCUMENT_4 0.000000
120 NAME_INCOME_TYPE_Businessman 0.000000

243 rows × 2 columns

feature_importances['importance'].sum()
np.float64(0.9999999999999997)

画个倒立直方图看看

feature_importances_plot = feature_importances.sort_values(by='importance', ascending=False)[:15]
plt.figure(figsize=(12,8))
sns.barplot(
    data = feature_importances_plot,
    x = 'importance',
    y = 'feature'
)
plt.title('impotance of features')
Text(0.5, 1.0, 'impotance of features')
../../_images/eae55c40d0a5a5c3c10af922f53dc2327cadd6e419f9bf90f4cf55c918a270f9.png

model#

Adaboost#

from sklearn.ensemble import AdaBoostClassifier
print(train.shape)
(307511, 243)
adaboost_model = AdaBoostClassifier(
    n_estimators=100
)
adaboost_model.fit(train, train_labels)
AdaBoostClassifier(n_estimators=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
adaboost_model_pred = adaboost_model.predict(test)
submit = application_test[['SK_ID_CURR']]
submit['TARGET'] = adaboost_model_pred

submit.to_csv('adaboost.csv', index = False)

得分 50

adaboost_model.feature_importances

Gradient boost#

train = pd.read_feather('checkpoints/01_train_app_base.feather')
test = pd.read_feather('checkpoints/01_test_app_base.feather')
train
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION ... WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_No EMERGENCYSTATE_MODE_Yes TARGET DAYS_EMPLOYED_ANOM
0 100002 0 202500.0 406597.5 24700.5 351000.0 0.018801 9461 -637.0 -3648.0 ... False False False False True False True False 1 False
1 100003 0 270000.0 1293502.5 35698.5 1129500.0 0.003541 16765 -1188.0 -1186.0 ... False False False False False False True False 0 False
2 100004 0 67500.0 135000.0 6750.0 135000.0 0.010032 19046 -225.0 -4260.0 ... False False False False False False False False 0 False
3 100006 0 135000.0 312682.5 29686.5 297000.0 0.008019 19005 -3039.0 -9833.0 ... False False False False False False False False 0 False
4 100007 0 121500.0 513000.0 21865.5 513000.0 0.028663 19932 -3038.0 -4311.0 ... False False False False False False False False 0 False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 456251 0 157500.0 254700.0 27558.0 225000.0 0.032561 9327 -236.0 -8456.0 ... False False False False True False True False 0 False
307507 456252 0 72000.0 269550.0 12001.5 225000.0 0.025164 20775 NaN -4388.0 ... False False False False True False True False 0 True
307508 456253 0 153000.0 677664.0 29979.0 585000.0 0.005002 14966 -7921.0 -6737.0 ... False False False True False False True False 0 False
307509 456254 0 171000.0 370107.0 20205.0 319500.0 0.005313 11961 -4786.0 -2562.0 ... False False False False True False True False 1 False
307510 456255 0 157500.0 675000.0 49117.5 675000.0 0.046220 16856 -1262.0 -5128.0 ... False False False True False False True False 0 False

307511 rows × 244 columns

train_labels = train['TARGET']
train_ids = train['SK_ID_CURR']
test_ids = test['SK_ID_CURR']
train_features = train.drop(columns=['TARGET', 'SK_ID_CURR'])
test_features = test.drop(columns=['SK_ID_CURR'])
train_features, test_features = impute_and_scaler(train_features, test_features)
print(train_features.shape, test_features.shape)
(307511, 242) (48744, 242)
from sklearn.ensemble import GradientBoostingClassifier
gradient_boost_model = GradientBoostingClassifier(
    n_estimators = 100,
    learning_rate = 0.3
)
gradient_boost_model.fit(train_features, train_labels)
GradientBoostingClassifier(learning_rate=0.3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

查看训练集上效果

  • roc_and_auc

from sklearn.metrics import roc_auc_score
train_probs = gradient_boost_model.predict_proba(train_features)[:, 1]
train_auc = roc_auc_score(train_labels, train_probs)
train_auc
0.7695779581683545
gradient_boost_model_pred = gradient_boost_model.predict_proba(test_features)[:, 1]
submit = pd.DataFrame({
    'SK_ID_CURR': test_ids
})
submit['TARGET'] = gradient_boost_model_pred

submit.to_csv('gradient_boost.csv', index = False)

得分 52分

TODO 为什么随机森林和boost显示出大的区别?

  • 随机森林在65分左右

  • boost在50分左右

可能是异常数据过大了