调优自动特征工程#

在自动特征工程中,我们的df字段类型都是由woodwork自动推断的，几乎是完全自动化的过程。

我们需要对字段类型引入更多的人为设置

时间序列
自定义原语
类型纠正

导入#

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import featuretools as ft
import woodwork as ww
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import NaturalLanguage, Datetime,Boolean
from featuretools.primitives import AggregationPrimitive, TransformPrimitive
from featuretools.tests.testing_utils import make_ecommerce_entityset
import warnings
warnings.filterwarnings('ignore')
import gc
gc.enable()

print(f'ft: {ft.__version__},  ww: {ww.__version__}')

c:\Users\63517\miniconda3\envs\data-analysis\lib\site-packages\woodwork\__init__.py:2: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

ft: 1.31.0,  ww: 0.31.0

application_train = pd.read_csv('data/application_train.csv')
application_test = pd.read_csv('data/application_test.csv')
bureau = pd.read_csv('data/bureau.csv')
bureau_balance = pd.read_csv('data/bureau_balance.csv')
credit_card_balance = pd.read_csv('data/credit_card_balance.csv')
installments_payments = pd.read_csv('data/installments_payments.csv')
previous_application = pd.read_csv('data/previous_application.csv')
pos_cash_balance = pd.read_csv('data/POS_CASH_balance.csv')

为了验证我们的处理过程，必须先抽样一些。

# application_train = application_train.iloc[:1000, :]
# application_test = application_test.iloc[:1000, :]
# bureau = bureau.iloc[:1000, :]
# bureau_balance = bureau_balance.iloc[:1000, :]
# credit_card_balance = credit_card_balance.iloc[:1000, :]
# installments_payments = installments_payments.iloc[:1000, :]
# previous_application = previous_application.iloc[:1000, :]
# pos_cash_balance = pos_cash_balance.iloc[:1000, :]

application_train['set'] = 'train'
application_test['set'] = 'test'
application_test['TARGET'] = np.nan
print(application_train.shape, application_test.shape)
app = pd.concat([application_train, application_test], ignore_index=True)

app_target = app[['SK_ID_CURR', 'TARGET']]
app_set = app[['SK_ID_CURR', 'set']]
app = app.drop(columns=['set'])

(307511, 123) (48744, 123)

application_train.dtypes.unique()

array([dtype('int64'), dtype('O'), dtype('float64')], dtype=object)

ww.logical_types

<module 'woodwork.logical_types' from 'c:\\Users\\63517\\miniconda3\\envs\\data-analysis\\lib\\site-packages\\woodwork\\logical_types.py'>

woodwork认识：#

物理类型
逻辑类型：
语义标签：额外数据含义

逻辑类型是必须的，语义标签是可选的

woodwork使用了Pandas 的 Accessor机制，这是扩展接口，在import featuretools时候，就把ww加上去了

woodwork初始时候会为其添加逻辑类型，语义标签

语义标签#

ww.list_semantic_tags()

	name	is_standard_tag	valid_logical_types
0	numeric	True	[Age, AgeFractional, AgeNullable, Double, Inte...
1	category	True	[Categorical, CountryCode, CurrencyCode, Ordin...
2	index	False	Any LogicalType
3	time_index	False	[Datetime, Age, AgeFractional, AgeNullable, Do...
4	date_of_birth	False	[Datetime]
5	ignore	False	Any LogicalType
6	passthrough	False	Any LogicalType

numeric, category 标准语义标签和特定的逻辑类型关联
index,time_index woodwork为一些索引列添加标签，表明一些含义
date_of_birth 表明应该解释为出生日期
ignore,passthrough 应该被忽略，在ft过程中

我们应该添加额外标签帮助解释

逻辑类型#

ww.list_logical_types()

	name	type_string	description	physical_type	standard_tags	is_default_type	is_registered	parent_type
0	Address	address	Represents Logical Types that contain address ...	string	{}	True	True	None
1	Age	age	Represents Logical Types that contain whole nu...	int64	{numeric}	True	True	Integer
2	AgeFractional	age_fractional	Represents Logical Types that contain non-nega...	float64	{numeric}	True	True	Double
3	AgeNullable	age_nullable	Represents Logical Types that contain whole nu...	Int64	{numeric}	True	True	IntegerNullable
4	Boolean	boolean	Represents Logical Types that contain binary v...	bool	{}	True	True	BooleanNullable
5	BooleanNullable	boolean_nullable	Represents Logical Types that contain binary v...	boolean	{}	True	True	None
6	Categorical	categorical	Represents Logical Types that contain unordere...	category	{category}	True	True	None
7	CountryCode	country_code	Represents Logical Types that use the ISO-3166...	category	{category}	True	True	Categorical
8	CurrencyCode	currency_code	Represents Logical Types that use the ISO-4217...	category	{category}	True	True	Categorical
9	Datetime	datetime	Represents Logical Types that contain date and...	datetime64[ns]	{}	True	True	None
10	Double	double	Represents Logical Types that contain positive...	float64	{numeric}	True	True	None
11	EmailAddress	email_address	Represents Logical Types that contain email ad...	string	{}	True	True	Unknown
12	Filepath	filepath	Represents Logical Types that specify location...	string	{}	True	True	None
13	IPAddress	ip_address	Represents Logical Types that contain IP addre...	string	{}	True	True	Unknown
14	Integer	integer	Represents Logical Types that contain positive...	int64	{numeric}	True	True	IntegerNullable
15	IntegerNullable	integer_nullable	Represents Logical Types that contain positive...	Int64	{numeric}	True	True	None
16	LatLong	lat_long	Represents Logical Types that contain latitude...	object	{}	True	True	None
17	NaturalLanguage	natural_language	Represents Logical Types that contain text or ...	string	{}	True	True	None
18	Ordinal	ordinal	Represents Logical Types that contain ordered ...	category	{category}	True	True	Categorical
19	PersonFullName	person_full_name	Represents Logical Types that may contain firs...	string	{}	True	True	None
20	PhoneNumber	phone_number	Represents Logical Types that contain numeric ...	string	{}	True	True	Unknown
21	PostalCode	postal_code	Represents Logical Types that contain a series...	category	{category}	True	True	Categorical
22	SubRegionCode	sub_region_code	Represents Logical Types that use the ISO-3166...	category	{category}	True	True	Categorical
23	Timedelta	timedelta	Represents Logical Types that contain values s...	timedelta64[ns]	{}	True	True	Unknown
24	URL	url	Represents Logical Types that contain URLs, wh...	string	{}	True	True	Unknown
25	Unknown	unknown	Represents Logical Types that cannot be inferr...	string	{}	True	True	None

unknown类型#

当woodwork类型推导没能成功，就设置unknown. 我们可以手动设置他

比如，下面例子，国家代码没有推导类型成功，就设置了Unknown. 我们可以手动设置CountryCode

s = pd.Series(['AU', 'US', 'UA'])
unkown_series = ww.init_series(s)
unkown_series.ww

<Series: None (Physical Type = string) (Logical Type = Unknown) (Semantic Tags = set())>

countrycode_series = ww.init_series(unkown_series, 'CountryCode')
countrycode_series.ww

<Series: None (Physical Type = category) (Logical Type = CountryCode) (Semantic Tags = {'category'})>

key

IntegerNullable#

表明是整数，但是可能会有空值，要小心

series = pd.Series([1, 2, None, 4], dtype="Int64")
intn_series = ww.init_series(series)
intn_series.ww

<Series: None (Physical Type = Int64) (Logical Type = IntegerNullable) (Semantic Tags = {'numeric'})>

ordinal 有序类型#

评分，排名等

ww处理#

我们根据自动推导的，在进行微调

application#

对于set和target这样的标签，我们手动处理，即不让进入ft过程

app.ww.init(name= 'app', index='SK_ID_CURR')

app.ww.name

'app'

app.ww.schema

	Logical Type	Semantic Tag(s)
Column
SK_ID_CURR	Integer	['index']
TARGET	IntegerNullable	['numeric']
NAME_CONTRACT_TYPE	Categorical	['category']
CODE_GENDER	Categorical	['category']
FLAG_OWN_CAR	Boolean	[]
FLAG_OWN_REALTY	Boolean	[]
CNT_CHILDREN	Integer	['numeric']
AMT_INCOME_TOTAL	Double	['numeric']
AMT_CREDIT	Double	['numeric']
AMT_ANNUITY	Double	['numeric']
AMT_GOODS_PRICE	Double	['numeric']
NAME_TYPE_SUITE	Categorical	['category']
NAME_INCOME_TYPE	Categorical	['category']
NAME_EDUCATION_TYPE	Categorical	['category']
NAME_FAMILY_STATUS	Categorical	['category']
NAME_HOUSING_TYPE	Categorical	['category']
REGION_POPULATION_RELATIVE	Double	['numeric']
DAYS_BIRTH	Integer	['numeric']
DAYS_EMPLOYED	Integer	['numeric']
DAYS_REGISTRATION	Double	['numeric']
DAYS_ID_PUBLISH	Integer	['numeric']
OWN_CAR_AGE	IntegerNullable	['numeric']
FLAG_MOBIL	Integer	['numeric']
FLAG_EMP_PHONE	Integer	['numeric']
FLAG_WORK_PHONE	Integer	['numeric']
FLAG_CONT_MOBILE	Integer	['numeric']
FLAG_PHONE	Integer	['numeric']
FLAG_EMAIL	Integer	['numeric']
OCCUPATION_TYPE	Categorical	['category']
CNT_FAM_MEMBERS	Double	['numeric']
REGION_RATING_CLIENT	Integer	['numeric']
REGION_RATING_CLIENT_W_CITY	Integer	['numeric']
WEEKDAY_APPR_PROCESS_START	Categorical	['category']
HOUR_APPR_PROCESS_START	Integer	['numeric']
REG_REGION_NOT_LIVE_REGION	Integer	['numeric']
REG_REGION_NOT_WORK_REGION	Integer	['numeric']
LIVE_REGION_NOT_WORK_REGION	Integer	['numeric']
REG_CITY_NOT_LIVE_CITY	Integer	['numeric']
REG_CITY_NOT_WORK_CITY	Integer	['numeric']
LIVE_CITY_NOT_WORK_CITY	Integer	['numeric']
ORGANIZATION_TYPE	Categorical	['category']
EXT_SOURCE_1	Double	['numeric']
EXT_SOURCE_2	Double	['numeric']
EXT_SOURCE_3	Double	['numeric']
APARTMENTS_AVG	Double	['numeric']
BASEMENTAREA_AVG	Double	['numeric']
YEARS_BEGINEXPLUATATION_AVG	Double	['numeric']
YEARS_BUILD_AVG	Double	['numeric']
COMMONAREA_AVG	Double	['numeric']
ELEVATORS_AVG	Double	['numeric']
ENTRANCES_AVG	Double	['numeric']
FLOORSMAX_AVG	Double	['numeric']
FLOORSMIN_AVG	Double	['numeric']
LANDAREA_AVG	Double	['numeric']
LIVINGAPARTMENTS_AVG	Double	['numeric']
LIVINGAREA_AVG	Double	['numeric']
NONLIVINGAPARTMENTS_AVG	Double	['numeric']
NONLIVINGAREA_AVG	Double	['numeric']
APARTMENTS_MODE	Double	['numeric']
BASEMENTAREA_MODE	Double	['numeric']
YEARS_BEGINEXPLUATATION_MODE	Double	['numeric']
YEARS_BUILD_MODE	Double	['numeric']
COMMONAREA_MODE	Double	['numeric']
ELEVATORS_MODE	Double	['numeric']
ENTRANCES_MODE	Double	['numeric']
FLOORSMAX_MODE	Double	['numeric']
FLOORSMIN_MODE	Double	['numeric']
LANDAREA_MODE	Double	['numeric']
LIVINGAPARTMENTS_MODE	Double	['numeric']
LIVINGAREA_MODE	Double	['numeric']
NONLIVINGAPARTMENTS_MODE	Double	['numeric']
NONLIVINGAREA_MODE	Double	['numeric']
APARTMENTS_MEDI	Double	['numeric']
BASEMENTAREA_MEDI	Double	['numeric']
YEARS_BEGINEXPLUATATION_MEDI	Double	['numeric']
YEARS_BUILD_MEDI	Double	['numeric']
COMMONAREA_MEDI	Double	['numeric']
ELEVATORS_MEDI	Double	['numeric']
ENTRANCES_MEDI	Double	['numeric']
FLOORSMAX_MEDI	Double	['numeric']
FLOORSMIN_MEDI	Double	['numeric']
LANDAREA_MEDI	Double	['numeric']
LIVINGAPARTMENTS_MEDI	Double	['numeric']
LIVINGAREA_MEDI	Double	['numeric']
NONLIVINGAPARTMENTS_MEDI	Double	['numeric']
NONLIVINGAREA_MEDI	Double	['numeric']
FONDKAPREMONT_MODE	Categorical	['category']
HOUSETYPE_MODE	Categorical	['category']
TOTALAREA_MODE	Double	['numeric']
WALLSMATERIAL_MODE	Categorical	['category']
EMERGENCYSTATE_MODE	BooleanNullable	[]
OBS_30_CNT_SOCIAL_CIRCLE	IntegerNullable	['numeric']
DEF_30_CNT_SOCIAL_CIRCLE	IntegerNullable	['numeric']
OBS_60_CNT_SOCIAL_CIRCLE	IntegerNullable	['numeric']
DEF_60_CNT_SOCIAL_CIRCLE	IntegerNullable	['numeric']
DAYS_LAST_PHONE_CHANGE	IntegerNullable	['numeric']
FLAG_DOCUMENT_2	Integer	['numeric']
FLAG_DOCUMENT_3	Integer	['numeric']
FLAG_DOCUMENT_4	Integer	['numeric']
FLAG_DOCUMENT_5	Integer	['numeric']
FLAG_DOCUMENT_6	Integer	['numeric']
FLAG_DOCUMENT_7	Integer	['numeric']
FLAG_DOCUMENT_8	Integer	['numeric']
FLAG_DOCUMENT_9	Integer	['numeric']
FLAG_DOCUMENT_10	Integer	['numeric']
FLAG_DOCUMENT_11	Integer	['numeric']
FLAG_DOCUMENT_12	Integer	['numeric']
FLAG_DOCUMENT_13	Integer	['numeric']
FLAG_DOCUMENT_14	Integer	['numeric']
FLAG_DOCUMENT_15	Integer	['numeric']
FLAG_DOCUMENT_16	Integer	['numeric']
FLAG_DOCUMENT_17	Integer	['numeric']
FLAG_DOCUMENT_18	Integer	['numeric']
FLAG_DOCUMENT_19	Integer	['numeric']
FLAG_DOCUMENT_20	Integer	['numeric']
FLAG_DOCUMENT_21	Integer	['numeric']
AMT_REQ_CREDIT_BUREAU_HOUR	IntegerNullable	['numeric']
AMT_REQ_CREDIT_BUREAU_DAY	IntegerNullable	['numeric']
AMT_REQ_CREDIT_BUREAU_WEEK	IntegerNullable	['numeric']
AMT_REQ_CREDIT_BUREAU_MON	IntegerNullable	['numeric']
AMT_REQ_CREDIT_BUREAU_QRT	IntegerNullable	['numeric']
AMT_REQ_CREDIT_BUREAU_YEAR	IntegerNullable	['numeric']

flag, is_not 设置为bool

FLAG_DOCUMENTS = { f'FLAG_DOCUMENT_{i}':'Boolean' for i in range(2, 22)}
app.ww.set_types(
    logical_types = {
        'FLAG_MOBIL': 'Boolean',
        'FLAG_EMP_PHONE': 'Boolean',
        'FLAG_WORK_PHONE': 'Boolean',
        'FLAG_CONT_MOBILE': 'Boolean',
        'FLAG_EMAIL': 'Boolean',
        'FLAG_PHONE': 'Boolean',
        'REG_CITY_NOT_LIVE_CITY': 'Boolean',
        'REG_CITY_NOT_WORK_CITY': 'Boolean',
        'LIVE_CITY_NOT_WORK_CITY': 'Boolean',
        **FLAG_DOCUMENTS
    }
)

评级

对于一些异常的，我们可以代替为np.nan， ft是不会处理的

app['REGION_RATING_CLIENT'].unique()

array([2, 1, 3], dtype=int64)

app['REGION_RATING_CLIENT_W_CITY'].unique()

array([ 2,  1,  3, -1], dtype=int64)

app['REGION_RATING_CLIENT_W_CITY'][app['REGION_RATING_CLIENT_W_CITY'] == -1]

224393   -1
Name: REGION_RATING_CLIENT_W_CITY, dtype: int64

app['REGION_RATING_CLIENT_W_CITY'] = app['REGION_RATING_CLIENT_W_CITY'].replace(-1, np.nan)

app.ww.set_types(
    logical_types = {
        'REGION_RATING_CLIENT': ww.logical_types.Ordinal(order=[1,2,3]),
        'REGION_RATING_CLIENT_W_CITY': ww.logical_types.Ordinal(order=[1,2,3]),
    }
)

时间段，应该为分类

app.ww.set_types(
    logical_types = {
        'HOUR_APPR_PROCESS_START': 'Categorical',
    }
)

bureau#

bureau.ww.init(name= 'bureau', index='SK_ID_BUREAU')

bureau.ww.schema

	Logical Type	Semantic Tag(s)
Column
SK_ID_CURR	Integer	['numeric']
SK_ID_BUREAU	Integer	['index']
CREDIT_ACTIVE	Categorical	['category']
CREDIT_CURRENCY	Categorical	['category']
DAYS_CREDIT	Integer	['numeric']
CREDIT_DAY_OVERDUE	Integer	['numeric']
DAYS_CREDIT_ENDDATE	IntegerNullable	['numeric']
DAYS_ENDDATE_FACT	IntegerNullable	['numeric']
AMT_CREDIT_MAX_OVERDUE	Double	['numeric']
CNT_CREDIT_PROLONG	Integer	['numeric']
AMT_CREDIT_SUM	Double	['numeric']
AMT_CREDIT_SUM_DEBT	Double	['numeric']
AMT_CREDIT_SUM_LIMIT	Double	['numeric']
AMT_CREDIT_SUM_OVERDUE	Double	['numeric']
CREDIT_TYPE	Categorical	['category']
DAYS_CREDIT_UPDATE	Integer	['numeric']
AMT_ANNUITY	Double	['numeric']

id不参与

bureau.ww.set_types(semantic_tags={'SK_ID_CURR':'ignore'})

bureau_balance = bureau_balance.reset_index().rename(columns = {'index':'bureaubalance_index'})
bureau_balance.ww.init(name='bureau_balance', index='bureaubalance_index')

bureau_balance.ww.schema

	Logical Type	Semantic Tag(s)
Column
bureaubalance_index	Integer	['index']
SK_ID_BUREAU	Integer	['numeric']
MONTHS_BALANCE	Integer	['numeric']
STATUS	Categorical	['category']

bureau_balance.ww.set_types(semantic_tags={'SK_ID_BUREAU':'ignore'})

previous#

previous_application.ww.init(name='previous', index='SK_ID_PREV')

previous_application.ww.schema

	Logical Type	Semantic Tag(s)
Column
SK_ID_PREV	Integer	['index']
SK_ID_CURR	Integer	['numeric']
NAME_CONTRACT_TYPE	Categorical	['category']
AMT_ANNUITY	Double	['numeric']
AMT_APPLICATION	Double	['numeric']
AMT_CREDIT	Double	['numeric']
AMT_DOWN_PAYMENT	Double	['numeric']
AMT_GOODS_PRICE	Double	['numeric']
WEEKDAY_APPR_PROCESS_START	Categorical	['category']
HOUR_APPR_PROCESS_START	Integer	['numeric']
FLAG_LAST_APPL_PER_CONTRACT	Boolean	[]
NFLAG_LAST_APPL_IN_DAY	Integer	['numeric']
RATE_DOWN_PAYMENT	Double	['numeric']
RATE_INTEREST_PRIMARY	Double	['numeric']
RATE_INTEREST_PRIVILEGED	Double	['numeric']
NAME_CASH_LOAN_PURPOSE	Categorical	['category']
NAME_CONTRACT_STATUS	Categorical	['category']
DAYS_DECISION	Integer	['numeric']
NAME_PAYMENT_TYPE	Categorical	['category']
CODE_REJECT_REASON	Categorical	['category']
NAME_TYPE_SUITE	Categorical	['category']
NAME_CLIENT_TYPE	Categorical	['category']
NAME_GOODS_CATEGORY	Categorical	['category']
NAME_PORTFOLIO	Categorical	['category']
NAME_PRODUCT_TYPE	Categorical	['category']
CHANNEL_TYPE	Categorical	['category']
SELLERPLACE_AREA	Integer	['numeric']
NAME_SELLER_INDUSTRY	Categorical	['category']
CNT_PAYMENT	IntegerNullable	['numeric']
NAME_YIELD_GROUP	Categorical	['category']
PRODUCT_COMBINATION	Categorical	['category']
DAYS_FIRST_DRAWING	IntegerNullable	['numeric']
DAYS_FIRST_DUE	IntegerNullable	['numeric']
DAYS_LAST_DUE_1ST_VERSION	IntegerNullable	['numeric']
DAYS_LAST_DUE	IntegerNullable	['numeric']
DAYS_TERMINATION	IntegerNullable	['numeric']
NFLAG_INSURED_ON_APPROVAL	IntegerNullable	['numeric']

previous_application.ww.set_types(semantic_tags={'SK_ID_CURR':'ignore'})

previous_application.ww.set_types(
    logical_types = {
        'HOUR_APPR_PROCESS_START': 'Categorical',
    }
)

previous_application['NFLAG_LAST_APPL_IN_DAY'].unique()

array([1, 0], dtype=int64)

previous_application['NFLAG_INSURED_ON_APPROVAL'].unique()

<IntegerArray>
[0, 1, <NA>]
Length: 3, dtype: Int64

previous_application['NFLAG_INSURED_ON_APPROVAL'].isnull().sum()

nan比例过大，我们不应该设置NFLAG_INSURED_ON_APPROVAL 为布尔, 而是Categorical

previous_application.ww.set_types(
    logical_types = {
        'NFLAG_LAST_APPL_IN_DAY': 'Boolean',
        'NFLAG_INSURED_ON_APPROVAL': 'Categorical'
        }
)

credit_card_balance = credit_card_balance.reset_index().rename(columns = {'index':'credit_index'})
credit_card_balance.ww.init(name='credit', index = 'credit_index')

credit_card_balance.ww.schema

	Logical Type	Semantic Tag(s)
Column
credit_index	Integer	['index']
SK_ID_PREV	Integer	['numeric']
SK_ID_CURR	Integer	['numeric']
MONTHS_BALANCE	Integer	['numeric']
AMT_BALANCE	Double	['numeric']
AMT_CREDIT_LIMIT_ACTUAL	Integer	['numeric']
AMT_DRAWINGS_ATM_CURRENT	Double	['numeric']
AMT_DRAWINGS_CURRENT	Double	['numeric']
AMT_DRAWINGS_OTHER_CURRENT	Double	['numeric']
AMT_DRAWINGS_POS_CURRENT	Double	['numeric']
AMT_INST_MIN_REGULARITY	Double	['numeric']
AMT_PAYMENT_CURRENT	Double	['numeric']
AMT_PAYMENT_TOTAL_CURRENT	Double	['numeric']
AMT_RECEIVABLE_PRINCIPAL	Double	['numeric']
AMT_RECIVABLE	Double	['numeric']
AMT_TOTAL_RECEIVABLE	Double	['numeric']
CNT_DRAWINGS_ATM_CURRENT	IntegerNullable	['numeric']
CNT_DRAWINGS_CURRENT	Integer	['numeric']
CNT_DRAWINGS_OTHER_CURRENT	IntegerNullable	['numeric']
CNT_DRAWINGS_POS_CURRENT	IntegerNullable	['numeric']
CNT_INSTALMENT_MATURE_CUM	IntegerNullable	['numeric']
NAME_CONTRACT_STATUS	Categorical	['category']
SK_DPD	Integer	['numeric']
SK_DPD_DEF	Integer	['numeric']

credit_card_balance.ww.set_types(semantic_tags={'SK_ID_CURR':'ignore', 'SK_ID_PREV':'ignore'})

installments_payments = installments_payments.reset_index().rename(columns = {'index':'installments_index'})

installments_payments.ww.init(name = 'installments', index='installments_index')

installments_payments.ww.schema

	Logical Type	Semantic Tag(s)
Column
installments_index	Integer	['index']
SK_ID_PREV	Integer	['numeric']
SK_ID_CURR	Integer	['numeric']
NUM_INSTALMENT_VERSION	Double	['numeric']
NUM_INSTALMENT_NUMBER	Integer	['numeric']
DAYS_INSTALMENT	Double	['numeric']
DAYS_ENTRY_PAYMENT	IntegerNullable	['numeric']
AMT_INSTALMENT	Double	['numeric']
AMT_PAYMENT	Double	['numeric']

installments_payments.ww.set_types(semantic_tags={'SK_ID_CURR':'ignore', 'SK_ID_PREV':'ignore'})

NUM_INSTALMENT_VERSION 更换类型为整数，这也会影响到后面得特征矩阵

installments_payments.ww.set_types(
    logical_types = {
        'NUM_INSTALMENT_VERSION': 'Integer',
        'DAYS_INSTALMENT': 'Integer'
    }
)

installments_payments['DAYS_INSTALMENT'].isnull().sum()

installments_payments['NUM_INSTALMENT_VERSION'].isnull().sum()

pos_cash_balance = pos_cash_balance.reset_index().rename(columns = {'index':'cash_index'})
pos_cash_balance.ww.init(name='cash', index='cash_index')

pos_cash_balance.ww.schema

	Logical Type	Semantic Tag(s)
Column
cash_index	Integer	['index']
SK_ID_PREV	Integer	['numeric']
SK_ID_CURR	Integer	['numeric']
MONTHS_BALANCE	Integer	['numeric']
CNT_INSTALMENT	IntegerNullable	['numeric']
CNT_INSTALMENT_FUTURE	IntegerNullable	['numeric']
NAME_CONTRACT_STATUS	Categorical	['category']
SK_DPD	Integer	['numeric']
SK_DPD_DEF	Integer	['numeric']

pos_cash_balance.ww.set_types(semantic_tags={'SK_ID_CURR':'ignore', 'SK_ID_PREV':'ignore'})

构建es#

featuretools对于已经初始化的ww，有些要求：

具备index
name

es = ft.EntitySet(id='clients')

# 有主键唯一列
es = es.add_dataframe( dataframe=app,)
es = es.add_dataframe( dataframe=bureau)
es = es.add_dataframe( dataframe=previous_application)

# 没有主键唯一的列，需要make_index, 创建一列主键
es = es.add_dataframe(dataframe=bureau_balance)
es = es.add_dataframe( dataframe=credit_card_balance)
es = es.add_dataframe( dataframe=installments_payments)
es = es.add_dataframe(dataframe=pos_cash_balance)

# 父亲dfname, 父亲列名； 字dfname, 子列名
es = es.add_relationship("app", "SK_ID_CURR", "bureau", "SK_ID_CURR")
es = es.add_relationship("bureau", "SK_ID_BUREAU", "bureau_balance", "SK_ID_BUREAU")

es = es.add_relationship("app", "SK_ID_CURR", "previous", "SK_ID_CURR")
es = es.add_relationship("previous", "SK_ID_PREV", "cash", "SK_ID_PREV")
es = es.add_relationship("previous", "SK_ID_PREV", "installments", "SK_ID_PREV")
es = es.add_relationship("previous", "SK_ID_PREV", "credit", "SK_ID_PREV")

在构建完关系后，语义标签上会携带外键

es['app'].ww

	Physical Type	Logical Type	Semantic Tag(s)
Column
SK_ID_CURR	int64	Integer	['index']
TARGET	Int64	IntegerNullable	['numeric']
NAME_CONTRACT_TYPE	category	Categorical	['category']
CODE_GENDER	category	Categorical	['category']
FLAG_OWN_CAR	bool	Boolean	[]
FLAG_OWN_REALTY	bool	Boolean	[]
CNT_CHILDREN	int64	Integer	['numeric']
AMT_INCOME_TOTAL	float64	Double	['numeric']
AMT_CREDIT	float64	Double	['numeric']
AMT_ANNUITY	float64	Double	['numeric']
AMT_GOODS_PRICE	float64	Double	['numeric']
NAME_TYPE_SUITE	category	Categorical	['category']
NAME_INCOME_TYPE	category	Categorical	['category']
NAME_EDUCATION_TYPE	category	Categorical	['category']
NAME_FAMILY_STATUS	category	Categorical	['category']
NAME_HOUSING_TYPE	category	Categorical	['category']
REGION_POPULATION_RELATIVE	float64	Double	['numeric']
DAYS_BIRTH	int64	Integer	['numeric']
DAYS_EMPLOYED	int64	Integer	['numeric']
DAYS_REGISTRATION	float64	Double	['numeric']
DAYS_ID_PUBLISH	int64	Integer	['numeric']
OWN_CAR_AGE	Int64	IntegerNullable	['numeric']
FLAG_MOBIL	bool	Boolean	[]
FLAG_EMP_PHONE	bool	Boolean	[]
FLAG_WORK_PHONE	bool	Boolean	[]
FLAG_CONT_MOBILE	bool	Boolean	[]
FLAG_PHONE	bool	Boolean	[]
FLAG_EMAIL	bool	Boolean	[]
OCCUPATION_TYPE	category	Categorical	['category']
CNT_FAM_MEMBERS	float64	Double	['numeric']
REGION_RATING_CLIENT	category	Ordinal: [1, 2, 3]	['category']
REGION_RATING_CLIENT_W_CITY	category	Ordinal: [1, 2, 3]	['category']
WEEKDAY_APPR_PROCESS_START	category	Categorical	['category']
HOUR_APPR_PROCESS_START	category	Categorical	['category']
REG_REGION_NOT_LIVE_REGION	int64	Integer	['numeric']
REG_REGION_NOT_WORK_REGION	int64	Integer	['numeric']
LIVE_REGION_NOT_WORK_REGION	int64	Integer	['numeric']
REG_CITY_NOT_LIVE_CITY	bool	Boolean	[]
REG_CITY_NOT_WORK_CITY	bool	Boolean	[]
LIVE_CITY_NOT_WORK_CITY	bool	Boolean	[]
ORGANIZATION_TYPE	category	Categorical	['category']
EXT_SOURCE_1	float64	Double	['numeric']
EXT_SOURCE_2	float64	Double	['numeric']
EXT_SOURCE_3	float64	Double	['numeric']
APARTMENTS_AVG	float64	Double	['numeric']
BASEMENTAREA_AVG	float64	Double	['numeric']
YEARS_BEGINEXPLUATATION_AVG	float64	Double	['numeric']
YEARS_BUILD_AVG	float64	Double	['numeric']
COMMONAREA_AVG	float64	Double	['numeric']
ELEVATORS_AVG	float64	Double	['numeric']
ENTRANCES_AVG	float64	Double	['numeric']
FLOORSMAX_AVG	float64	Double	['numeric']
FLOORSMIN_AVG	float64	Double	['numeric']
LANDAREA_AVG	float64	Double	['numeric']
LIVINGAPARTMENTS_AVG	float64	Double	['numeric']
LIVINGAREA_AVG	float64	Double	['numeric']
NONLIVINGAPARTMENTS_AVG	float64	Double	['numeric']
NONLIVINGAREA_AVG	float64	Double	['numeric']
APARTMENTS_MODE	float64	Double	['numeric']
BASEMENTAREA_MODE	float64	Double	['numeric']
YEARS_BEGINEXPLUATATION_MODE	float64	Double	['numeric']
YEARS_BUILD_MODE	float64	Double	['numeric']
COMMONAREA_MODE	float64	Double	['numeric']
ELEVATORS_MODE	float64	Double	['numeric']
ENTRANCES_MODE	float64	Double	['numeric']
FLOORSMAX_MODE	float64	Double	['numeric']
FLOORSMIN_MODE	float64	Double	['numeric']
LANDAREA_MODE	float64	Double	['numeric']
LIVINGAPARTMENTS_MODE	float64	Double	['numeric']
LIVINGAREA_MODE	float64	Double	['numeric']
NONLIVINGAPARTMENTS_MODE	float64	Double	['numeric']
NONLIVINGAREA_MODE	float64	Double	['numeric']
APARTMENTS_MEDI	float64	Double	['numeric']
BASEMENTAREA_MEDI	float64	Double	['numeric']
YEARS_BEGINEXPLUATATION_MEDI	float64	Double	['numeric']
YEARS_BUILD_MEDI	float64	Double	['numeric']
COMMONAREA_MEDI	float64	Double	['numeric']
ELEVATORS_MEDI	float64	Double	['numeric']
ENTRANCES_MEDI	float64	Double	['numeric']
FLOORSMAX_MEDI	float64	Double	['numeric']
FLOORSMIN_MEDI	float64	Double	['numeric']
LANDAREA_MEDI	float64	Double	['numeric']
LIVINGAPARTMENTS_MEDI	float64	Double	['numeric']
LIVINGAREA_MEDI	float64	Double	['numeric']
NONLIVINGAPARTMENTS_MEDI	float64	Double	['numeric']
NONLIVINGAREA_MEDI	float64	Double	['numeric']
FONDKAPREMONT_MODE	category	Categorical	['category']
HOUSETYPE_MODE	category	Categorical	['category']
TOTALAREA_MODE	float64	Double	['numeric']
WALLSMATERIAL_MODE	category	Categorical	['category']
EMERGENCYSTATE_MODE	boolean	BooleanNullable	[]
OBS_30_CNT_SOCIAL_CIRCLE	Int64	IntegerNullable	['numeric']
DEF_30_CNT_SOCIAL_CIRCLE	Int64	IntegerNullable	['numeric']
OBS_60_CNT_SOCIAL_CIRCLE	Int64	IntegerNullable	['numeric']
DEF_60_CNT_SOCIAL_CIRCLE	Int64	IntegerNullable	['numeric']
DAYS_LAST_PHONE_CHANGE	Int64	IntegerNullable	['numeric']
FLAG_DOCUMENT_2	bool	Boolean	[]
FLAG_DOCUMENT_3	bool	Boolean	[]
FLAG_DOCUMENT_4	bool	Boolean	[]
FLAG_DOCUMENT_5	bool	Boolean	[]
FLAG_DOCUMENT_6	bool	Boolean	[]
FLAG_DOCUMENT_7	bool	Boolean	[]
FLAG_DOCUMENT_8	bool	Boolean	[]
FLAG_DOCUMENT_9	bool	Boolean	[]
FLAG_DOCUMENT_10	bool	Boolean	[]
FLAG_DOCUMENT_11	bool	Boolean	[]
FLAG_DOCUMENT_12	bool	Boolean	[]
FLAG_DOCUMENT_13	bool	Boolean	[]
FLAG_DOCUMENT_14	bool	Boolean	[]
FLAG_DOCUMENT_15	bool	Boolean	[]
FLAG_DOCUMENT_16	bool	Boolean	[]
FLAG_DOCUMENT_17	bool	Boolean	[]
FLAG_DOCUMENT_18	bool	Boolean	[]
FLAG_DOCUMENT_19	bool	Boolean	[]
FLAG_DOCUMENT_20	bool	Boolean	[]
FLAG_DOCUMENT_21	bool	Boolean	[]
AMT_REQ_CREDIT_BUREAU_HOUR	Int64	IntegerNullable	['numeric']
AMT_REQ_CREDIT_BUREAU_DAY	Int64	IntegerNullable	['numeric']
AMT_REQ_CREDIT_BUREAU_WEEK	Int64	IntegerNullable	['numeric']
AMT_REQ_CREDIT_BUREAU_MON	Int64	IntegerNullable	['numeric']
AMT_REQ_CREDIT_BUREAU_QRT	Int64	IntegerNullable	['numeric']
AMT_REQ_CREDIT_BUREAU_YEAR	Int64	IntegerNullable	['numeric']

es['bureau'].ww

	Physical Type	Logical Type	Semantic Tag(s)
Column
SK_ID_CURR	int64	Integer	['foreign_key', 'numeric', 'ignore']
SK_ID_BUREAU	int64	Integer	['index']
CREDIT_ACTIVE	category	Categorical	['category']
CREDIT_CURRENCY	category	Categorical	['category']
DAYS_CREDIT	int64	Integer	['numeric']
CREDIT_DAY_OVERDUE	int64	Integer	['numeric']
DAYS_CREDIT_ENDDATE	Int64	IntegerNullable	['numeric']
DAYS_ENDDATE_FACT	Int64	IntegerNullable	['numeric']
AMT_CREDIT_MAX_OVERDUE	float64	Double	['numeric']
CNT_CREDIT_PROLONG	int64	Integer	['numeric']
AMT_CREDIT_SUM	float64	Double	['numeric']
AMT_CREDIT_SUM_DEBT	float64	Double	['numeric']
AMT_CREDIT_SUM_LIMIT	float64	Double	['numeric']
AMT_CREDIT_SUM_OVERDUE	float64	Double	['numeric']
CREDIT_TYPE	category	Categorical	['category']
DAYS_CREDIT_UPDATE	int64	Integer	['numeric']
AMT_ANNUITY	float64	Double	['numeric']

es['bureau_balance'].ww

	Physical Type	Logical Type	Semantic Tag(s)
Column
bureaubalance_index	int64	Integer	['index']
SK_ID_BUREAU	int64	Integer	['foreign_key', 'numeric', 'ignore']
MONTHS_BALANCE	int64	Integer	['numeric']
STATUS	category	Categorical	['category']

es['previous'].ww

	Physical Type	Logical Type	Semantic Tag(s)
Column
SK_ID_PREV	int64	Integer	['index']
SK_ID_CURR	int64	Integer	['foreign_key', 'numeric', 'ignore']
NAME_CONTRACT_TYPE	category	Categorical	['category']
AMT_ANNUITY	float64	Double	['numeric']
AMT_APPLICATION	float64	Double	['numeric']
AMT_CREDIT	float64	Double	['numeric']
AMT_DOWN_PAYMENT	float64	Double	['numeric']
AMT_GOODS_PRICE	float64	Double	['numeric']
WEEKDAY_APPR_PROCESS_START	category	Categorical	['category']
HOUR_APPR_PROCESS_START	category	Categorical	['category']
FLAG_LAST_APPL_PER_CONTRACT	bool	Boolean	[]
NFLAG_LAST_APPL_IN_DAY	bool	Boolean	[]
RATE_DOWN_PAYMENT	float64	Double	['numeric']
RATE_INTEREST_PRIMARY	float64	Double	['numeric']
RATE_INTEREST_PRIVILEGED	float64	Double	['numeric']
NAME_CASH_LOAN_PURPOSE	category	Categorical	['category']
NAME_CONTRACT_STATUS	category	Categorical	['category']
DAYS_DECISION	int64	Integer	['numeric']
NAME_PAYMENT_TYPE	category	Categorical	['category']
CODE_REJECT_REASON	category	Categorical	['category']
NAME_TYPE_SUITE	category	Categorical	['category']
NAME_CLIENT_TYPE	category	Categorical	['category']
NAME_GOODS_CATEGORY	category	Categorical	['category']
NAME_PORTFOLIO	category	Categorical	['category']
NAME_PRODUCT_TYPE	category	Categorical	['category']
CHANNEL_TYPE	category	Categorical	['category']
SELLERPLACE_AREA	int64	Integer	['numeric']
NAME_SELLER_INDUSTRY	category	Categorical	['category']
CNT_PAYMENT	Int64	IntegerNullable	['numeric']
NAME_YIELD_GROUP	category	Categorical	['category']
PRODUCT_COMBINATION	category	Categorical	['category']
DAYS_FIRST_DRAWING	Int64	IntegerNullable	['numeric']
DAYS_FIRST_DUE	Int64	IntegerNullable	['numeric']
DAYS_LAST_DUE_1ST_VERSION	Int64	IntegerNullable	['numeric']
DAYS_LAST_DUE	Int64	IntegerNullable	['numeric']
DAYS_TERMINATION	Int64	IntegerNullable	['numeric']
NFLAG_INSURED_ON_APPROVAL	category	Categorical	['category']

es['cash'].ww

	Physical Type	Logical Type	Semantic Tag(s)
Column
cash_index	int64	Integer	['index']
SK_ID_PREV	int64	Integer	['foreign_key', 'numeric', 'ignore']
SK_ID_CURR	int64	Integer	['numeric', 'ignore']
MONTHS_BALANCE	int64	Integer	['numeric']
CNT_INSTALMENT	Int64	IntegerNullable	['numeric']
CNT_INSTALMENT_FUTURE	Int64	IntegerNullable	['numeric']
NAME_CONTRACT_STATUS	category	Categorical	['category']
SK_DPD	int64	Integer	['numeric']
SK_DPD_DEF	int64	Integer	['numeric']

es['credit'].ww

	Physical Type	Logical Type	Semantic Tag(s)
Column
credit_index	int64	Integer	['index']
SK_ID_PREV	int64	Integer	['foreign_key', 'numeric', 'ignore']
SK_ID_CURR	int64	Integer	['numeric', 'ignore']
MONTHS_BALANCE	int64	Integer	['numeric']
AMT_BALANCE	float64	Double	['numeric']
AMT_CREDIT_LIMIT_ACTUAL	int64	Integer	['numeric']
AMT_DRAWINGS_ATM_CURRENT	float64	Double	['numeric']
AMT_DRAWINGS_CURRENT	float64	Double	['numeric']
AMT_DRAWINGS_OTHER_CURRENT	float64	Double	['numeric']
AMT_DRAWINGS_POS_CURRENT	float64	Double	['numeric']
AMT_INST_MIN_REGULARITY	float64	Double	['numeric']
AMT_PAYMENT_CURRENT	float64	Double	['numeric']
AMT_PAYMENT_TOTAL_CURRENT	float64	Double	['numeric']
AMT_RECEIVABLE_PRINCIPAL	float64	Double	['numeric']
AMT_RECIVABLE	float64	Double	['numeric']
AMT_TOTAL_RECEIVABLE	float64	Double	['numeric']
CNT_DRAWINGS_ATM_CURRENT	Int64	IntegerNullable	['numeric']
CNT_DRAWINGS_CURRENT	int64	Integer	['numeric']
CNT_DRAWINGS_OTHER_CURRENT	Int64	IntegerNullable	['numeric']
CNT_DRAWINGS_POS_CURRENT	Int64	IntegerNullable	['numeric']
CNT_INSTALMENT_MATURE_CUM	Int64	IntegerNullable	['numeric']
NAME_CONTRACT_STATUS	category	Categorical	['category']
SK_DPD	int64	Integer	['numeric']
SK_DPD_DEF	int64	Integer	['numeric']

es['installments'].ww

	Physical Type	Logical Type	Semantic Tag(s)
Column
installments_index	int64	Integer	['index']
SK_ID_PREV	int64	Integer	['foreign_key', 'numeric', 'ignore']
SK_ID_CURR	int64	Integer	['numeric', 'ignore']
NUM_INSTALMENT_VERSION	int64	Integer	['numeric']
NUM_INSTALMENT_NUMBER	int64	Integer	['numeric']
DAYS_INSTALMENT	int64	Integer	['numeric']
DAYS_ENTRY_PAYMENT	Int64	IntegerNullable	['numeric']
AMT_INSTALMENT	float64	Double	['numeric']
AMT_PAYMENT	float64	Double	['numeric']

添加interesting values#

就是where = 条件聚合。

比如设置agg原语mean, 产生MEAN(prev.AMT_CREDIT)

如果另外设置where原语count 和兴趣 {"NAME_CONTRACT_STATUS": ["Approved", "Refused"]} 就会多两个特征

COUNT(prev.AMT_CREDIT where NAME_CONTRACT_STATUS==Approved)
COUNT(prev.AMT_CREDIT where NAME_CONTRACT_STATUS==Refused)

es.add_interesting_values(dataframe_name='previous', values= {
    "NAME_CONTRACT_STATUS": ["Approved", "Refused"]
})

es['previous'].ww.columns['NAME_CONTRACT_STATUS'].metadata

{'dataframe_name': 'previous',
 'entityset_id': 'clients',
 'interesting_values': ['Approved', 'Refused']}

我们确实为这个列添加了where值

seed feature#

没什么特别，就是构造了一列

previous_application['AMT_CREDIT'].mean()

196114.0212179794

FLAG_LATED = ft.Feature(es['installments'].ww['DAYS_ENTRY_PAYMENT']) > ft.Feature(es['installments'].ww['DAYS_INSTALMENT'])

FLAG_LATED

<Feature: DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT>

FLAG_LATED标识了一种条件

# 
FLAG_DUE = ft.Feature(es['bureau_balance'].ww['STATUS']).isin(['1', '2', '3', '4', '5'])

FLAG_DUE

<Feature: STATUS.isin(['1', '2', '3', '4', '5'])>

print(FLAG_DUE.column_schema.logical_type)

Boolean

自定义特征原语#

对于自定义原语，我们一定要小心其性能

es['previous'].ww['NAME_CONTRACT_STATUS'].value_counts().sum()

class NormalizedModeCount(AggregationPrimitive):
    """ 计算出现最多的次数占比总数的比例。
    """
    name = 'normalized_mode_count'
    input_types = [ColumnSchema(semantic_tags={'category'})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def get_function(self):
        def normalized_mode_count(column):
            if len(column) == 0:
                return 0
            counts = column.value_counts()
            if len(counts) == 0:
                return 0
            return counts.max()/counts.sum()
        return normalized_mode_count

比如对于NAME_CONTRACT_STATUS，表明以往申请通过或者拒绝的比例

class MaxConsecutive(AggregationPrimitive):
    """ 最大连续次数，一般针对bool
    """
    name = 'max_consecutive'
    input_types = [ColumnSchema(logical_type=Boolean)]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    def get_function(self):
        def max_consecutive(column):
            v = column.values
            if len(v) == 0: return 0
            
            # 在首尾补 0 方便计算切换点
            calls = np.concatenate(([0], v, [0]))
            # 寻找从 0 变 1 和从 1 变 0 的位置
            diffs = np.diff(calls.astype(int))
            starts = np.where(diffs == 1)[0]
            ends = np.where(diffs == -1)[0]
            
            if len(starts) == 0: return 0
            # 长度即为 结束索引 - 开始索引
            return np.max(ends - starts)
        return max_consecutive

Warning

我们必须清晰，哪些原语作用哪些列！

dfs#

特征数估计：主表100列，从表50列，兴趣特征13个分类值，原语5个。 50 5 * 3 + 100

外键和索引不需要管

%%time
default_agg_primitives = [
    "count",  # index
    "mean", "max", "sum", "std",  # numeric
    "mode", "num_unique", # categorical
    'percent_true' # boolean
    ]
default_trans_primitives =  ["month", "weekday"]

# 返回特征矩阵； 特征
feature_matrix, features = ft.dfs(
    entityset = es,
    target_dataframe_name = 'app', # 最后要关联到这个表，以这个为主
    agg_primitives= default_agg_primitives + [NormalizedModeCount, MaxConsecutive],
    trans_primitives=default_trans_primitives,
    max_depth=2,
    seed_features=[FLAG_LATED, FLAG_DUE],
    # n_jobs=2,        # 使用2个核
    where_primitives=['count', 'mean', 'percent_true'],
)

CPU times: total: 4h 41min 18s
Wall time: 5h 46min 14s

feature_matrix.to_parquet("ft_tuning_feature_matrix.parquet")
ft.save_features(features, "ft_tuning_feature_definitions.json")

耗时1h30min

我们需要检查我们的特征确实生效了

[f for f in features if f.primitive.name == 'normalized_mode_count']

[<Feature: NORMALIZED_MODE_COUNT(bureau.CREDIT_ACTIVE)>,
 <Feature: NORMALIZED_MODE_COUNT(bureau.CREDIT_CURRENCY)>,
 <Feature: NORMALIZED_MODE_COUNT(bureau.CREDIT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(bureau_balance.STATUS)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.CHANNEL_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.CODE_REJECT_REASON)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.HOUR_APPR_PROCESS_START)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_CASH_LOAN_PURPOSE)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_CLIENT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_CONTRACT_STATUS)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_CONTRACT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_GOODS_CATEGORY)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_PAYMENT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_PORTFOLIO)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_PRODUCT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_SELLER_INDUSTRY)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_TYPE_SUITE)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NAME_YIELD_GROUP)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.NFLAG_INSURED_ON_APPROVAL)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.PRODUCT_COMBINATION)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.WEEKDAY_APPR_PROCESS_START)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.NAME_CONTRACT_STATUS)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.NAME_CONTRACT_STATUS)>,
 <Feature: NORMALIZED_MODE_COUNT(bureau.MODE(bureau_balance.STATUS))>,
 <Feature: NORMALIZED_MODE_COUNT(bureau_balance.bureau.CREDIT_ACTIVE)>,
 <Feature: NORMALIZED_MODE_COUNT(bureau_balance.bureau.CREDIT_CURRENCY)>,
 <Feature: NORMALIZED_MODE_COUNT(bureau_balance.bureau.CREDIT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(previous.MODE(cash.NAME_CONTRACT_STATUS))>,
 <Feature: NORMALIZED_MODE_COUNT(previous.MODE(credit.NAME_CONTRACT_STATUS))>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.CHANNEL_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.CODE_REJECT_REASON)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.HOUR_APPR_PROCESS_START)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_CASH_LOAN_PURPOSE)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_CLIENT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_CONTRACT_STATUS)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_CONTRACT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_GOODS_CATEGORY)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_PAYMENT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_PORTFOLIO)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_PRODUCT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_SELLER_INDUSTRY)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_TYPE_SUITE)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NAME_YIELD_GROUP)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.NFLAG_INSURED_ON_APPROVAL)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.PRODUCT_COMBINATION)>,
 <Feature: NORMALIZED_MODE_COUNT(cash.previous.WEEKDAY_APPR_PROCESS_START)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.CHANNEL_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.CODE_REJECT_REASON)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.HOUR_APPR_PROCESS_START)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_CASH_LOAN_PURPOSE)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_CLIENT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_CONTRACT_STATUS)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_CONTRACT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_GOODS_CATEGORY)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_PAYMENT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_PORTFOLIO)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_PRODUCT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_SELLER_INDUSTRY)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_TYPE_SUITE)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NAME_YIELD_GROUP)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.NFLAG_INSURED_ON_APPROVAL)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.PRODUCT_COMBINATION)>,
 <Feature: NORMALIZED_MODE_COUNT(installments.previous.WEEKDAY_APPR_PROCESS_START)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.CHANNEL_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.CODE_REJECT_REASON)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.HOUR_APPR_PROCESS_START)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_CASH_LOAN_PURPOSE)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_CLIENT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_CONTRACT_STATUS)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_CONTRACT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_GOODS_CATEGORY)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_PAYMENT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_PORTFOLIO)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_PRODUCT_TYPE)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_SELLER_INDUSTRY)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_TYPE_SUITE)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NAME_YIELD_GROUP)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.NFLAG_INSURED_ON_APPROVAL)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.PRODUCT_COMBINATION)>,
 <Feature: NORMALIZED_MODE_COUNT(credit.previous.WEEKDAY_APPR_PROCESS_START)>]

[f for f in features if f.primitive.name == 'max_consecutive']

[<Feature: MAX_CONSECUTIVE(previous.FLAG_LAST_APPL_PER_CONTRACT)>,
 <Feature: MAX_CONSECUTIVE(previous.NFLAG_LAST_APPL_IN_DAY)>,
 <Feature: MAX_CONSECUTIVE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5']))>,
 <Feature: MAX_CONSECUTIVE(cash.previous.FLAG_LAST_APPL_PER_CONTRACT)>,
 <Feature: MAX_CONSECUTIVE(cash.previous.NFLAG_LAST_APPL_IN_DAY)>,
 <Feature: MAX_CONSECUTIVE(installments.previous.FLAG_LAST_APPL_PER_CONTRACT)>,
 <Feature: MAX_CONSECUTIVE(installments.previous.NFLAG_LAST_APPL_IN_DAY)>,
 <Feature: MAX_CONSECUTIVE(credit.previous.FLAG_LAST_APPL_PER_CONTRACT)>,
 <Feature: MAX_CONSECUTIVE(credit.previous.NFLAG_LAST_APPL_IN_DAY)>]

[f for f in features if '>' in f.get_name()]

[<Feature: PERCENT_TRUE(installments.DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT)>,
 <Feature: MAX(previous.PERCENT_TRUE(installments.DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT))>,
 <Feature: MEAN(previous.PERCENT_TRUE(installments.DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT))>,
 <Feature: MEAN(previous.PERCENT_TRUE(installments.DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT) WHERE NAME_CONTRACT_STATUS = Approved)>,
 <Feature: MEAN(previous.PERCENT_TRUE(installments.DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT) WHERE NAME_CONTRACT_STATUS = Refused)>,
 <Feature: STD(previous.PERCENT_TRUE(installments.DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT))>,
 <Feature: SUM(previous.PERCENT_TRUE(installments.DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT))>,
 <Feature: PERCENT_TRUE(installments.DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT WHERE previous.NAME_CONTRACT_STATUS = Refused)>,
 <Feature: PERCENT_TRUE(installments.DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT WHERE previous.NAME_CONTRACT_STATUS = Approved)>]

[f for f in features if 'isin' in f.get_name()]

[<Feature: MAX_CONSECUTIVE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5']))>,
 <Feature: PERCENT_TRUE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5']))>,
 <Feature: MAX(bureau.MAX_CONSECUTIVE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5'])))>,
 <Feature: MAX(bureau.PERCENT_TRUE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5'])))>,
 <Feature: MEAN(bureau.MAX_CONSECUTIVE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5'])))>,
 <Feature: MEAN(bureau.PERCENT_TRUE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5'])))>,
 <Feature: STD(bureau.MAX_CONSECUTIVE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5'])))>,
 <Feature: STD(bureau.PERCENT_TRUE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5'])))>,
 <Feature: SUM(bureau.MAX_CONSECUTIVE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5'])))>,
 <Feature: SUM(bureau.PERCENT_TRUE(bureau_balance.STATUS.isin(['1', '2', '3', '4', '5'])))>]

特征按照我们预想的添加了。

modeling#

lgbm不需要one-hot

feature_matrix = pd.read_parquet("ft_tuning_feature_matrix.parquet")

feature_matrix.shape

(356255, 1891)

final_fm = feature_matrix.reset_index()

final_fm['TARGET']

0            1
1            0
2            0
3            0
4            0
          ... 
356250    <NA>
356251    <NA>
356252    <NA>
356253    <NA>
356254    <NA>
Name: TARGET, Length: 356255, dtype: Int64

final_fm = pd.merge(final_fm, app_set, on='SK_ID_CURR', how='left')

train = final_fm[final_fm['set'] == 'train']
test = final_fm[final_fm['set'] == 'test']

train, test = train.align(test, join = 'inner', axis = 1)
train = train.drop(columns=['set'])
test = test.drop(columns = ['TARGET', 'set'])
print(train.shape, test.shape)

(307511, 1892) (48744, 1891)

train_labels = train['TARGET']
train_ids = train['SK_ID_CURR']
test_ids = test['SK_ID_CURR']
train_features = train.drop(columns=['TARGET', 'SK_ID_CURR'])
test_features = test.drop(columns=['SK_ID_CURR'])

import re
# 1. 定义清理函数
def clean_names(df):
    # 替换所有非字母、数字的字符为下划线
    # 这里的正则 [^A-Za-z0-9_] 会匹配空格、斜杠、括号等所有特殊字符
    df.columns = [re.sub(r'[^A-Za-z0-9_]+', '_', col) for col in df.columns]
    # 顺便处理一下可能出现的重复下划线，比如 __
    df.columns = [re.sub(r'_+', '_', col).strip('_') for col in df.columns]
    return df
    

train_features = clean_names(train_features)
test_features = clean_names(test_features)

from lightgbm import LGBMClassifier
lgbm_model = LGBMClassifier(
    n_estimators=100,      # 对应 max_iter，树的个数
    learning_rate=0.1,     # 学习率
    max_depth=3,           # 树的最大深度
    random_state=42,       # 保证结果可复现
)
lgbm_model.fit(train_features, train_labels)

[LightGBM] [Info] Number of positive: 24825, number of negative: 282686
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 2.594608 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 293128
[LightGBM] [Info] Number of data points in the train set: 307511, number of used features: 1666
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432486
[LightGBM] [Info] Start training from score -2.432486
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

LGBMClassifier(max_depth=3, n_jobs=-1, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

features_importance = pd.DataFrame(
    {
        'importance': lgbm_model.feature_importances_,
        'feature': lgbm_model.feature_name_
    }
)
features_importance_plot = features_importance.sort_values(by='importance', ascending=False).head(20)

plt.figure(figsize=(8, 6), dpi=100) 
sns.barplot(data=features_importance_plot, x='importance', y='feature')

plt.yticks(fontsize=7) # 进一步微调
plt.title('Feature Importance', fontsize=14)
plt.tight_layout()

../../_images/84630a3b548cd30a90ff84130b7647ee45d05b4a6d397b01a2f05775085a8a50.png

import time
import os

def submit(ids, pred, name, feature_count=None):
    """
    ids: 测试集的 SK_ID_CURR
    pred: 模型预测概率
    name: 你的实验备注 (如 'lgb_v1', 'baseline')
    feature_count: 可选，记录模型使用了多少个特征
    """
    # 1. 创建提交 DataFrame
    submit_df = pd.DataFrame({
        'SK_ID_CURR': ids,
        'TARGET': pred
    })

    # 2. 生成时间戳 (格式: 0213_1530)
    timestamp = time.strftime("%m%d_%H%M")
    
    # 3. 构造文件名
    # 格式: 0213_1530_lgb_v1_f542.csv
    f_str = f"_f{feature_count}" if feature_count else ""
    filename = f"{timestamp}_{name}{f_str}.csv"
    
    # 4. 确保保存目录存在 (可选)
    if not os.path.exists('submissions'):
        os.makedirs('submissions')
    
    save_path = os.path.join('submissions', filename)
    
    # 5. 保存并打印提示
    submit_df.to_csv(save_path, index=False)
    
    return submit_df

lgbm_model_pred = lgbm_model.predict_proba(test_features)

submit_df = submit(test['SK_ID_CURR'], lgbm_model_pred[:, 1], 
    name='lgbm_baseline',
    feature_count=train_features.shape[1]
    )
submit_df.head()

	SK_ID_CURR	TARGET
307511	100001	0.072112
307512	100005	0.162662
307513	100013	0.029653
307514	100028	0.034053
307515	100038	0.139559

得分76，差不多