Competition/Kaggle

[kaggle][필사] Costa Rican Household Proverty (2)

bisi 2020. 9. 20. 12:43

이번 주제는 Costa Rican Household Proverty 로, 

목표는 미주 개발 은행(Inter-American Development Bank)의 세계에서 가장 빈곤 한 일부 가정의 소득 자격을 예측을 하는 것이다. 

 

보통 세계 최 빈곤층은 그들의 자격을 증명하기가 어려운데, 라틴 아메리카는 알고리즘을 통해 소득자격을 확인한다.

예를 들어 프록시 수단 테스트(PMT)을 통해 벽과 천장의 재료 또는 집에서 발견 된 자산과 같은 가족의 관찰 가능한 가구 속성을 고려하는 것이다. 

 

이를 바탕으로 다양한 feature가 제공 되었는데, LGBMClassifier를 사용하여 소득 자격을 예측해본다. 

 

이번 필사는 이유한님의 코드를 참고하였다.

 

 

목록 

Costa Rican Household Proverty (1)

더보기

1. Check datasets

1.1 Read datasets

1.2 Make description df

1.3 Check null data

1.4 Fill missing values

Costa Rican Household Proverty (2)

더보기

2. Feature Engineering

2.1 Object features

2.1.1 dependency

2.1.2 edjefe

2.1.3 edjefa

2.1.4 roof and elecricity

 

2.2 카테고리 변수 추출

 

2.3 연속형 변수를 사용하여 새로운 변수 생성

2.3.1 연속형 변수 컬럼 추출

2.3.2 새로운 변수 생성

2.3.3 가족 변수의 대출 비율

2.3.4 가족 변수의 방 비율

2.3.5 가족 변수의 침대 비율

2.3.6 가족 변수의 태블릿 보유 비율

2.3.7 가족 변수의 핸드폰 보유 비율

2.3.8 가족 변수의 학창 시절의 몇년뒤 

2.3.9 Rich features

 

2.4 집합 변수

Costa Rican Household Proverty (3)

더보기

3. Feature Selection Using shap

4. Model Development

4.1 LGB를 통한 예측 및 변수 중요도 생성

4.2 랜덤하게 찾기 (Randomized Search)

 

 


 

Costa Rican Household Poverty Level Prediction

 

2. Feature Engineering

2.1 Object features

In [33]:
features_object = [col for col in df_train.columns if df_train[col].dtype == 'object']
features_object
Out[33]:
['Id', 'idhogar', 'dependency', 'edjefe', 'edjefa']
 

dependency

In [34]:
df_train['dependency'] = np.sqrt(df_train['SQBdependency'])
df_test['dependency'] = np.sqrt(df_test['SQBdependency'])
 

edjefe

  • 교육연수가 있다는 가정하에 남자 가장의 교육 연수를 뜻한다.
  • yes ->1 , no ->0으로 대체한다.
In [35]:
def replace_edjefe(x):
    if x == 'yes':
        return 1
    elif x == 'no':
        return 0
    else:
        return x
    
df_train['edjefe'] = df_train['edjefe'].apply(replace_edjefe).astype(float)
df_test['edjefe'] = df_test['edjefe'].apply(replace_edjefe).astype(float)
 

edjefa

  • 교육연수가 있다는 가정하에 여자 가장의 교육연수를 의미한다.
  • yes->1, no->0으로 대체한다.
In [36]:
def replace_edjefa(x):
    if x == 'yes':
        return 1
    elif x == 'no':
        return 0
    else:
        return x

df_train['edjefa'] = df_train['edjefa'].apply(replace_edjefa).astype(float)
df_test['edjefa'] = df_test['edjefa'].apply(replace_edjefa).astype(float)
In [37]:
# 여자나 남자 중에 가장의 교육연수가 가장 큰 값으로 변수를 생성한다.
df_train['edjef'] = np.max(df_train[['edjefa','edjefe']], axis=1)
df_test['edjef'] = np.max(df_test[['edjefa','edjefe']], axis=1)
 

roof and electricity

In [38]:
# 초기화
df_train['roof_waste_material'] = np.nan
df_test['roof_waste_material'] = np.nan
df_train['electricity_other'] = np.nan
df_test['electricity_other'] = np.nan

def fill_roof_exception(x):
    if (x['techozinc'] == 0) and (x['techoentrepiso'] == 0) and (x['techocane'] == 0) and (x['techootro'] == 0) :
        return 1
    else:
        return 0
    
def fill_no_electricity(x):
    if (x['public'] == 0) and (x['planpri'] == 0) and (x['noelec'] == 0) and (x['coopele'] == 0) :
        return 1
    else:
        return 0
    
df_train['roof_waste_material'] = df_train.apply(lambda x : fill_roof_exception(x), axis=1)
df_test['roof_waste_material'] = df_test.apply(lambda x : fill_roof_exception(x), axis=1)
df_train['electricity_other'] = df_train.apply(lambda x : fill_no_electricity(x), axis=1)
df_test['electricity_other'] = df_test.apply(lambda x : fill_no_electricity(x), axis=1)
 

2.2 카테고리 변수 추출

In [39]:
binary_cat_features = [col for col in df_train.columns if df_train[col].value_counts().shape[0] == 2]
len(binary_cat_features) # 이진 카테고리 변수 추출 
Out[39]:
103
 

2.3 연속형 변수를 사용하여 새로운 변수 생성

 

연속형 변수 컬럼 추출

In [40]:
continuous_features = [col for col in df_train.columns if col not in binary_cat_features] # unique한 value가 2개일 경우(이진 값)
continuous_features = [col for col in continuous_features if col not in features_object]
continuous_features = [col for col in continuous_features if col not in ['Id', 'Target','idhogar']]
In [41]:
print('Threr are {} continuous features'.format(len(continuous_features)))
for col in continuous_features:
    print('{}: {}'.format(col, description_ko.loc[description_ko['varname'] == col, 'description'].values))
 
Threr are 37 continuous features
v2a1: ['월세납부']
rooms: ['집 안의 모든 방의 수']
v18q1: ['가구 소유의 태블릿 수']
r4h1: ['12세 미만의 남성']
r4h2: ['남성 12세 이상']
r4h3: ['가구원수합계']
r4m1: ['12세 이하 여성']
r4m2: ['12세 이상 여성']
r4m3: ['가구원 총여성']
r4t1: ['12세 이하인 사람']
r4t2: ['12세 이상']
r4t3: ['가구원 합계']
tamhog: ['가구원수']
tamviv: ['가구원수']
escolari: ['다년간의 교육']
rez_esc: ['학창시절 몇년 뒤']
hhsize: ['가구 크기']
elimbasu5: ['=1 주로 강, 하천 또는 바다에 투척하여 쓰레기를 처리하는 경우']
hogar_nin: ['0~19세인 가구원 수 ']
hogar_adul: ['어른인 가구원수']
hogar_mayor: ['65세 이상인 가구원수 ']
hogar_total: ['가구원 총 인원수']
meaneduc: ['평균 성인 교육년(18+)']
bedrooms: ['침실 수']
overcrowding: ['# 객실당 인원수']
qmobilephone: ['휴대전화 번호']
age: ['연령']
SQBescolari: ['에스코라리 제곱']
SQBage: ['나이 제곱']
SQBhogar_total: ['hogar_total 제곱']
SQBedjefe: ['에드제프 제곱']
SQBhogar_nin: ['hogar_nin 제곱']
SQBovercrowding: ['과밀 제곱']
SQBdependency: ['종속성 제곱']
SQBmeaned: ['평균 제곱']
agesq: ['나이 제곱']
edjef: []
 
  • hhsize : household size
  • tamhog : size of the household

두변수의 의미가 같으므로, tamhog는 삭제한다.

In [42]:
df_train.drop('tamhog', axis=1, inplace=True)
df_test.drop('tamhog', axis=1, inplace=True)
 

새로운 변수 생성

 
  • Squared features

    • 많은 squared features 가 있다. 사실은 lightgbm과 같은 트리 모델에서는 그들이 필요 없다. 하지만 커널에서는 우리는 classfier로써 임배딩되어 있는 features 필터와 set entity로써는 lightgbm으로 사용할 수 있다. 그러니 지금은 그냥 두자.
  • Family features

    • hogar_nin, hogar_adul, hogar_mayor, hogar_total, r4h1, r4h2, r4h3, r4m1, r4m2, r4m3, r4t1, r4t2, r4t3, tmbhog, tamvid, rez_esc, escolari
    • Family size features(추출 및 비율구하기)
In [43]:
df_train['adult']= df_train['hogar_adul'] - df_train['hogar_mayor']# 65이하 성인 명수
df_train['dependency_count'] = df_train['hogar_nin'] + df_train['hogar_mayor']# 65이상 노인과 0~19살 어린이,청소년
df_train['dependency'] = df_train['dependency_count'] / df_train['adult']# 의존률 노약자/성인
df_train['child_percent'] = df_train['hogar_nin'] / df_train['hogar_total']# 어린이 비율
df_train['elder_percent'] = df_train['hogar_mayor'] / df_train['hogar_total']# 노인 비율
df_train['adult_percent'] = df_train['hogar_adul'] / df_train['hogar_total'] # 성인 비율 
df_train['males_younger_12_years_percent'] = df_train['r4h1'] / df_train['hogar_total'] # 12세 이하 남자 비율 
df_train['males_older_12_years_percent'] = df_train['r4h2'] / df_train['hogar_total'] # 12세 이상 남자 비율 
df_train['males_percent'] = df_train['r4h3'] / df_train['hogar_total'] # 총 가구원수의 남자 비율 
df_train['females_younger_12_years_percent'] = df_train['r4m1'] / df_train['hogar_total'] # 12세 이하 여자 비율 
df_train['females_older_12_years_percent'] = df_train['r4m2'] / df_train['hogar_total'] # 12세 이상 여자 비율 
df_train['females_percent'] = df_train['r4m3'] / df_train['hogar_total'] # 총 가구원수의 여자 비율 
df_train['persons_younger_12_years_percent'] = df_train['r4t1'] / df_train['hogar_total'] # 12세 이하 가구원수 비율 
df_train['persons_older_12_years_percent'] = df_train['r4t2'] / df_train['hogar_total'] # 12세 이상 가구원수 비율 
df_train['persons_percent'] = df_train['r4t3'] / df_train['hogar_total'] # 총 성인의 가구원수 비율 
In [44]:
# 남자, 여자 , 총인원의 가구원수의 비율 
df_test['adult']= df_test['hogar_adul'] - df_test['hogar_mayor'] # 65이하 성인 명수
df_test['dependency_count'] = df_test['hogar_nin'] + df_test['hogar_mayor'] # 65이상 노인과 0~19살 어린이,청소년
df_test['dependency'] = df_test['dependency_count'] / df_test['adult'] # 의존률 노약자/성인
df_test['child_percent'] = df_test['hogar_nin'] / df_test['hogar_total'] # 어린이 비율
df_test['elder_percent'] = df_test['hogar_mayor'] / df_test['hogar_total'] # 노인 비율
df_test['adult_percent'] = df_test['hogar_adul'] / df_test['hogar_total'] # 성인 비율 
df_test['males_younger_12_years_percent'] = df_test['r4h1'] / df_test['hogar_total'] # 12세 이하 남자 비율 
df_test['males_older_12_years_percent'] = df_test['r4h2'] / df_test['hogar_total'] # 12세 이상 남자 비율 
df_test['males_percent'] = df_test['r4h3'] / df_test['hogar_total']# 총 가구원수의 남자 비율 
df_test['females_younger_12_years_percent'] = df_test['r4m1'] / df_test['hogar_total']# 12세 이하 여자 비율 
df_test['females_older_12_years_percent'] = df_test['r4m2'] / df_test['hogar_total']# 12세 이상 여자 비율 
df_test['females_percent'] = df_test['r4m3'] / df_test['hogar_total']# 총 가구원수의 여자 비율 
df_test['persons_younger_12_years_percent'] = df_test['r4t1'] / df_test['hogar_total'] # 12세 이하 가구원수 비율 
df_test['persons_older_12_years_percent'] = df_test['r4t2'] / df_test['hogar_total']# 12세 이상 가구원수 비율
df_test['persons_percent'] = df_test['r4t3'] / df_test['hogar_total'] # 총 성인의 가구원수 비율 
In [45]:
# 남자, 여자 , 총인원의 가구원수의 비율 
df_train['males_younger_12_years_in_householde_size'] = df_train['r4h1'] / df_train['hhsize']
df_train['males_older_12_years_in_householde_size'] = df_train['r4h2'] / df_train['hhsize']
df_train['males_in_householde_size'] = df_train['r4h3'] / df_train['hhsize']
df_train['females_younger_12_years_in_householde_size'] = df_train['r4m1'] / df_train['hhsize']
df_train['females_older_12_years_in_householde_size'] = df_train['r4m2'] / df_train['hhsize']
df_train['females_in_householde_size'] = df_train['r4m3'] / df_train['hogar_total']
df_train['persons_younger_12_years_in_householde_size'] = df_train['r4t1'] / df_train['hhsize']
df_train['persons_older_12_years_in_householde_size'] = df_train['r4t2'] / df_train['hhsize']
df_train['persons_in_householde_size'] = df_train['r4t3'] / df_train['hhsize']
In [46]:
# 남자, 여자 , 총인원의 가구원수의 비율 
df_test['males_younger_12_years_in_householde_size'] = df_test['r4h1'] / df_test['hhsize']
df_test['males_older_12_years_in_householde_size'] = df_test['r4h2'] / df_test['hhsize']
df_test['males_in_householde_size'] = df_test['r4h3'] / df_test['hhsize']
df_test['females_younger_12_years_in_householde_size'] = df_test['r4m1'] / df_test['hhsize']
df_test['females_older_12_years_in_householde_size'] = df_test['r4m2'] / df_test['hhsize']
df_test['females_in_householde_size'] = df_test['r4m3'] / df_test['hogar_total']
df_test['persons_younger_12_years_in_householde_size'] = df_test['r4t1'] / df_test['hhsize']
df_test['persons_older_12_years_in_householde_size'] = df_test['r4t2'] / df_test['hhsize']
df_test['persons_in_householde_size'] = df_test['r4t3'] / df_test['hhsize']
In [47]:
 # 침실 평균 최대 수용인원 
df_train['overcrowding_room_and_bedroom'] = (df_train['hacdor'] + df_train['hacapo'])/2
df_test['overcrowding_room_and_bedroom'] = (df_test['hacdor'] + df_test['hacapo'])/2
In [48]:
# 나이대별 교육 연수
df_train['escolari_age'] = df_train['escolari'] / df_train['age']
df_test['escolari_age'] = df_test['escolari'] / df_test['age']

df_train['age_12_19'] = df_train['hogar_nin'] - df_train['r4t1']
df_test['age_12_19'] = df_test['hogar_nin'] - df_test['r4t1']
In [49]:
# 실제 가구원수의 생활 방식 비율 
df_train['phones-per-capita'] = df_train['qmobilephone'] / df_train['tamviv']
df_train['tablets-per-capita'] = df_train['v18q1'] / df_train['tamviv']
df_train['rooms-per-capita'] = df_train['rooms'] / df_train['tamviv']
df_train['rent-per-capita'] = df_train['v2a1'] / df_train['tamviv']

df_test['phones-per-capita'] = df_test['qmobilephone'] / df_test['tamviv']
df_test['tablets-per-capita'] = df_test['v18q1'] / df_test['tamviv']
df_test['rooms-per-capita'] = df_test['rooms'] / df_test['tamviv']
df_test['rent-per-capita'] = df_test['v2a1'] / df_test['tamviv']
 
  • 우리는 "Total persons in the household"와 "# of total individuals in the household"가 같지 않음을 확인할 수 있다.
  • 조금 이상하긴 하지만 지금은 그냥 둔다
In [50]:
(df_train['hogar_total'] == df_train['r4t3']).sum()
Out[50]:
9509
 

가족 변수의 대출 비율

Rent per family features

In [51]:
family_size_features= ['adult', 'hogar_adul', 'hogar_mayor', 'hogar_nin', 'hogar_total', 'r4h1', 
                        'r4h2', 'r4h3', 'r4m1', 'r4m2', 'r4m3', 'r4t1', 'r4t2', 'r4t3', 'hhsize']
new_feats = []
for col in family_size_features:
    new_col_name = 'new_{}_per_{}'.format('v2a1', col)
    new_feats.append(new_col_name)
    df_train[new_col_name] = df_train['v2a1'] / df_train[col]
    df_test[new_col_name] = df_test['v2a1'] / df_test[col]
 
  • Ratio 변수는 값이 무한으로 갈수 있다. 그런 값들은 0으로 처리한다.
In [52]:
for col in new_feats:
    df_train[col].replace([np.inf], np.nan, inplace=True)
    df_train[col].fillna(0, inplace=True)
    
    df_test[col].replace([np.inf], np.nan, inplace=True)
    df_test[col].fillna(0, inplace=True)
 

가족 변수의 방 비율

Room per family features

In [53]:
new_feats = []
for col in family_size_features:
    new_col_name = 'new_{}_per_{}'.format('rooms', col)
    new_feats.append(new_col_name)
    df_train[new_col_name] = df_train['rooms'] / df_train[col]
    df_test[new_col_name] = df_test['rooms'] / df_test[col]
    
for col in new_feats:
    df_train[col].replace([np.inf], np.nan, inplace=True)
    df_train[col].fillna(0, inplace=True)
    
    df_test[col].replace([np.inf], np.nan, inplace=True)
    df_test[col].fillna(0, inplace=True)
 

가족 변수의 침대 비율

BedRoom per family features

In [54]:
new_feats = []
for col in family_size_features:
    new_col_name = 'new_{}_per_{}'.format('bedrooms', col)
    new_feats.append(new_col_name)
    df_train[new_col_name] = df_train['bedrooms'] / df_train[col]
    df_test[new_col_name] = df_test['bedrooms'] / df_test[col]
    
for col in new_feats:
    df_train[col].replace([np.inf], np.nan, inplace=True)
    df_train[col].fillna(0, inplace=True)
    
    df_test[col].replace([np.inf], np.nan, inplace=True)
    df_test[col].fillna(0, inplace=True)
In [55]:
# feature 갯수 확인
print(df_train.shape, df_test.shape)
 
(9557, 220) (23856, 219)
 

가족 변수의 태블릿 보유 비율

Tabulet per family features

In [56]:
new_feats = []
for col in family_size_features:
    new_col_name = 'new_{}_per_{}'.format('v18q1', col)
    new_feats.append(new_col_name)
    df_train[new_col_name] = df_train['v18q1'] / df_train[col]
    df_test[new_col_name] = df_test['v18q1'] / df_test[col]
    
for col in new_feats:
    df_train[col].replace([np.inf], np.nan, inplace=True)
    df_train[col].fillna(0, inplace=True)
    
    df_test[col].replace([np.inf], np.nan, inplace=True)
    df_test[col].fillna(0, inplace=True)
 

가족 변수의 핸드폰 보유 비율

phone per family features

In [57]:
new_feats = []
for col in family_size_features:
    new_col_name = 'new_{}_per_{}'.format('qmobilephone', col)
    new_feats.append(new_col_name)
    df_train[new_col_name] = df_train['qmobilephone'] / df_train[col]
    df_test[new_col_name] = df_test['qmobilephone'] / df_test[col]
    
for col in new_feats:
    df_train[col].replace([np.inf], np.nan, inplace=True)
    df_train[col].fillna(0, inplace=True)
    
    df_test[col].replace([np.inf], np.nan, inplace=True)
    df_test[col].fillna(0, inplace=True)
 

가족 변수의 학창 시절의 몇년뒤 비율

rez_esc(Years behind in school:학창 시절의 몇년 뒤) per family features

In [58]:
new_feats = []
for col in family_size_features:
    new_col_name = 'new_{}_per_{}'.format('rez_esc', col)
    new_feats.append(new_col_name)
    df_train[new_col_name] = df_train['rez_esc'] / df_train[col]
    df_test[new_col_name] = df_test['rez_esc'] / df_test[col]
    
for col in new_feats:
    df_train[col].replace([np.inf], np.nan, inplace=True)
    df_train[col].fillna(0, inplace=True)
    
    df_test[col].replace([np.inf], np.nan, inplace=True)
    df_test[col].fillna(0, inplace=True)
In [59]:
df_train['rez_esc_age'] = df_train['rez_esc'] / df_train['age']
df_train['rez_esc_escolari'] = df_train['rez_esc'] / df_train['escolari']

df_test['rez_esc_age'] = df_test['rez_esc'] / df_test['age']
df_test['rez_esc_escolari'] = df_test['rez_esc'] / df_test['escolari']
 

Rich features

  • 필자는 phone이나 tabulet 많으면 부자라고 판단하고, 부자 관련 변수를 생성함
In [60]:
df_train['tabulet_x_qmobilephone'] = df_train['v18q1'] * df_train['qmobilephone']
df_test['tabulet_x_qmobilephone'] = df_test['v18q1'] * df_test['qmobilephone']
 
  • wall(벽), roof(지붕), floor(바닥)도 중요한 키요소이다.
  • 이제 변수들의 값을 곱해보자. 왜냐하면 그 변수들은 이진 카테고리 변수이기 때문이다. 그래서 각각의 변수들의 곱으로 새로운 카테고리 변수를 생성한다.
    • "epared1"," =1 if walls are bad"
    • "epared2"," =1 if walls are regular"
    • "epared3"," =1 if walls are good"
    • "etecho1"," =1 if roof are bad"
    • "etecho2"," =1 if roof are regular"
    • "etecho3"," =1 if roof are good"
    • "eviv1"," =1 if floor are bad"
    • "eviv2"," =1 if floor are regular"
    • "eviv3"," =1 if floor are good"
In [61]:
# wall and roof
for col1 in ['epared1','epared2','epared3']:
    for col2 in ['etecho1','etecho2','etecho3']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]
        
# wall and floor
for col1 in ['epared1','epared2','epared3']:
    for col2 in ['eviv1','eviv2','eviv3']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]
        
        
# roof and floor
for col1 in ['etecho1','etecho2','etecho3']:
    for col2 in ['eviv1','eviv2','eviv3']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]
In [62]:
# 3개 변수 합치기
for col1 in ['epared1','epared2','epared3']:
    for col2 in ['etecho1','etecho2','etecho3']:
        for col3 in ['eviv1','eviv2','eviv3']:
            new_col_name = 'new_{}_x_{}_x_{}'.format(col1, col2, col3)
            df_train[new_col_name] = df_train[col1] * df_train[col2] * df_train[col3]
#             df_test[new_col_name] = df_test[col1] * df_test[col2] * df_test[col3]
            df_test[new_col_name] = df_test[col1] * df_test[col2] * df_train[col3]
In [63]:
print(df_train.shape, df_test.shape)
 
(9557, 322) (23856, 321)
 
  • electricity와 energy 변수는 energy 변수로 합친다.
In [64]:
for col1 in ['public', 'planpri', 'noelec', 'coopele']:
    for col2 in ['energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]
 
  • 화장실과 쓰레기 처리변수는 other_infra 변수로 처리한다.
In [65]:
for col1 in ['sanitario1', 'sanitario2', 'sanitario3','sanitario5','sanitario6']:
    for col2 in ['elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 'elimbasu5', 'elimbasu6']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]
 
  • 화장실과 물 공급은 water 변수로 처리한다.
In [66]:
for col1 in ['abastaguadentro', 'abastaguafuera', 'abastaguano']:
    for col2 in ['sanitario1', 'sanitario2', 'sanitario3','sanitario5','sanitario6']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]      
In [67]:
print(df_train.shape, df_test.shape)
 
(9557, 383) (23856, 382)
 
  • 교육과 지역을 education_zone 변수로 처리한다.
In [68]:
for col1 in ['area1', 'area2']:
    for col2 in ['instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6',
                'instlevel7','instlevel8','instlevel9']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]   
 
  • 종교와 교육을 합친다.
In [69]:
for col1 in ['lugar1', 'lugar2', 'lugar3', 'lugar4','lugar5', 'lugar6']:
    for col2 in ['instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6',
                'instlevel7','instlevel8','instlevel9']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]   
In [70]:
print(df_train.shape, df_test.shape)
 
(9557, 455) (23856, 454)
 
  • television / mobilephone / computer / tabulet / refrigerator 모두 곱한 값을 electronics 변수로 처리한다.
  • television / mobilephone / computer / tabulet / refrigerator 모두 더한 값을 no_appliances 변수로 처리한다.
In [71]:
df_train['electronics'] = df_train['computer'] * df_train['mobilephone'] * df_train['television'] * df_train['v18q'] * df_train['refrig']
df_train['no_appliances'] = df_train['refrig'] + df_train['computer'] + df_train['television'] + df_train['refrig']

df_test['electronics'] = df_test['computer'] * df_test['mobilephone'] * df_test['television'] * df_test['v18q'] * df_test['refrig']
df_test['no_appliances'] = df_test['refrig'] + df_test['computer'] + df_test['television'] + df_test['refrig']
 
  • roof, floor, wall의 재료 변수를 합친다.
In [72]:
for col1 in ['paredblolad', 'paredzocalo', 'paredpreb', 'pareddes','paredmad','paredzinc','paredfibras','paredother']:
    for col2 in ['pisomoscer','pisocemento','pisoother','pisonatur','pisonotiene','pisomadera']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]   
        
for col1 in ['pisomoscer','pisocemento','pisoother','pisonatur','pisonotiene','pisomadera']:
    for col2 in ['techozinc','techoentrepiso','techocane','techootro']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]   
        
for col1 in ['paredblolad', 'paredzocalo', 'paredpreb', 'pareddes','paredmad','paredzinc','paredfibras','paredother']:
    for col2 in ['techozinc','techoentrepiso','techocane','techootro']:
        new_col_name = 'new_{}_x_{}'.format(col1, col2)
        df_train[new_col_name] = df_train[col1] * df_train[col2]
        df_test[new_col_name] = df_test[col1] * df_test[col2]   
        
for col1 in ['paredblolad', 'paredzocalo', 'paredpreb', 'pareddes','paredmad','paredzinc','paredfibras','paredother']:
    for col2 in ['pisomoscer','pisocemento','pisoother','pisonatur','pisonotiene','pisomadera']:
        for col3 in ['techozinc','techoentrepiso','techocane','techootro']:
            new_col_name = 'new_{}_x_{}_x_{}'.format(col1, col2, col3)
            df_train[new_col_name] = df_train[col1] * df_train[col2] * df_train[col3]
            df_test[new_col_name] = df_test[col1] * df_test[col2] * df_test[col3]  
In [73]:
print(df_train.shape, df_test.shape)
 
(9557, 753) (23856, 752)
 

오직 한개의값만 있는 변수는 지우기

In [74]:
col_with_only_one_value=[]
for col in df_train.columns:
    if col == 'Target':
        continue
    if df_train[col].value_counts().shape[0] == 1 or df_test[col].value_counts().shape[0] == 1 :
        print(col)
        col_with_only_one_value.append(col)
 
elimbasu5
new_planpri_x_energcocinar1
new_planpri_x_energcocinar2
new_planpri_x_energcocinar3
new_planpri_x_energcocinar4
new_noelec_x_energcocinar2
new_sanitario1_x_elimbasu4
new_sanitario1_x_elimbasu5
new_sanitario1_x_elimbasu6
new_sanitario2_x_elimbasu4
new_sanitario2_x_elimbasu5
new_sanitario2_x_elimbasu6
new_sanitario3_x_elimbasu5
new_sanitario5_x_elimbasu4
new_sanitario5_x_elimbasu5
new_sanitario5_x_elimbasu6
new_sanitario6_x_elimbasu2
new_sanitario6_x_elimbasu4
new_sanitario6_x_elimbasu5
new_sanitario6_x_elimbasu6
new_abastaguafuera_x_sanitario6
new_abastaguano_x_sanitario2
new_abastaguano_x_sanitario6
new_paredblolad_x_pisonatur
new_paredblolad_x_pisonotiene
new_paredzocalo_x_pisoother
new_paredzocalo_x_pisonatur
new_paredpreb_x_pisonatur
new_pareddes_x_pisoother
new_pareddes_x_pisonatur
new_paredmad_x_pisoother
new_paredmad_x_pisonatur
new_paredzinc_x_pisoother
new_paredzinc_x_pisonatur
new_paredfibras_x_pisoother
new_paredfibras_x_pisonatur
new_paredfibras_x_pisonotiene
new_paredfibras_x_pisomadera
new_paredother_x_pisoother
new_paredother_x_pisonatur
new_paredother_x_pisonotiene
new_paredother_x_pisomadera
new_pisocemento_x_techocane
new_pisocemento_x_techootro
new_pisoother_x_techoentrepiso
new_pisoother_x_techocane
new_pisoother_x_techootro
new_pisonatur_x_techozinc
new_pisonatur_x_techoentrepiso
new_pisonatur_x_techocane
new_pisonatur_x_techootro
new_pisonotiene_x_techoentrepiso
new_pisonotiene_x_techocane
new_pisonotiene_x_techootro
new_pisomadera_x_techocane
new_pisomadera_x_techootro
new_paredzocalo_x_techoentrepiso
new_paredzocalo_x_techocane
new_paredzocalo_x_techootro
new_paredpreb_x_techootro
new_pareddes_x_techoentrepiso
new_pareddes_x_techocane
new_pareddes_x_techootro
new_paredmad_x_techocane
new_paredmad_x_techootro
new_paredzinc_x_techoentrepiso
new_paredzinc_x_techocane
new_paredzinc_x_techootro
new_paredfibras_x_techoentrepiso
new_paredfibras_x_techootro
new_paredother_x_techoentrepiso
new_paredother_x_techocane
new_paredother_x_techootro
new_paredblolad_x_pisocemento_x_techocane
new_paredblolad_x_pisocemento_x_techootro
new_paredblolad_x_pisoother_x_techoentrepiso
new_paredblolad_x_pisoother_x_techocane
new_paredblolad_x_pisoother_x_techootro
new_paredblolad_x_pisonatur_x_techozinc
new_paredblolad_x_pisonatur_x_techoentrepiso
new_paredblolad_x_pisonatur_x_techocane
new_paredblolad_x_pisonatur_x_techootro
new_paredblolad_x_pisonotiene_x_techozinc
new_paredblolad_x_pisonotiene_x_techoentrepiso
new_paredblolad_x_pisonotiene_x_techocane
new_paredblolad_x_pisonotiene_x_techootro
new_paredblolad_x_pisomadera_x_techocane
new_paredblolad_x_pisomadera_x_techootro
new_paredzocalo_x_pisomoscer_x_techoentrepiso
new_paredzocalo_x_pisomoscer_x_techocane
new_paredzocalo_x_pisomoscer_x_techootro
new_paredzocalo_x_pisocemento_x_techoentrepiso
new_paredzocalo_x_pisocemento_x_techocane
new_paredzocalo_x_pisocemento_x_techootro
new_paredzocalo_x_pisoother_x_techozinc
new_paredzocalo_x_pisoother_x_techoentrepiso
new_paredzocalo_x_pisoother_x_techocane
new_paredzocalo_x_pisoother_x_techootro
new_paredzocalo_x_pisonatur_x_techozinc
new_paredzocalo_x_pisonatur_x_techoentrepiso
new_paredzocalo_x_pisonatur_x_techocane
new_paredzocalo_x_pisonatur_x_techootro
new_paredzocalo_x_pisonotiene_x_techoentrepiso
new_paredzocalo_x_pisonotiene_x_techocane
new_paredzocalo_x_pisonotiene_x_techootro
new_paredzocalo_x_pisomadera_x_techoentrepiso
new_paredzocalo_x_pisomadera_x_techocane
new_paredzocalo_x_pisomadera_x_techootro
new_paredpreb_x_pisomoscer_x_techootro
new_paredpreb_x_pisocemento_x_techoentrepiso
new_paredpreb_x_pisocemento_x_techocane
new_paredpreb_x_pisocemento_x_techootro
new_paredpreb_x_pisoother_x_techoentrepiso
new_paredpreb_x_pisoother_x_techocane
new_paredpreb_x_pisoother_x_techootro
new_paredpreb_x_pisonatur_x_techozinc
new_paredpreb_x_pisonatur_x_techoentrepiso
new_paredpreb_x_pisonatur_x_techocane
new_paredpreb_x_pisonatur_x_techootro
new_paredpreb_x_pisonotiene_x_techoentrepiso
new_paredpreb_x_pisonotiene_x_techocane
new_paredpreb_x_pisonotiene_x_techootro
new_paredpreb_x_pisomadera_x_techoentrepiso
new_paredpreb_x_pisomadera_x_techocane
new_paredpreb_x_pisomadera_x_techootro
new_pareddes_x_pisomoscer_x_techozinc
new_pareddes_x_pisomoscer_x_techoentrepiso
new_pareddes_x_pisomoscer_x_techocane
new_pareddes_x_pisomoscer_x_techootro
new_pareddes_x_pisocemento_x_techoentrepiso
new_pareddes_x_pisocemento_x_techocane
new_pareddes_x_pisocemento_x_techootro
new_pareddes_x_pisoother_x_techozinc
new_pareddes_x_pisoother_x_techoentrepiso
new_pareddes_x_pisoother_x_techocane
new_pareddes_x_pisoother_x_techootro
new_pareddes_x_pisonatur_x_techozinc
new_pareddes_x_pisonatur_x_techoentrepiso
new_pareddes_x_pisonatur_x_techocane
new_pareddes_x_pisonatur_x_techootro
new_pareddes_x_pisonotiene_x_techoentrepiso
new_pareddes_x_pisonotiene_x_techocane
new_pareddes_x_pisonotiene_x_techootro
new_pareddes_x_pisomadera_x_techozinc
new_pareddes_x_pisomadera_x_techoentrepiso
new_pareddes_x_pisomadera_x_techocane
new_pareddes_x_pisomadera_x_techootro
new_paredmad_x_pisomoscer_x_techocane
new_paredmad_x_pisomoscer_x_techootro
new_paredmad_x_pisocemento_x_techocane
new_paredmad_x_pisocemento_x_techootro
new_paredmad_x_pisoother_x_techozinc
new_paredmad_x_pisoother_x_techoentrepiso
new_paredmad_x_pisoother_x_techocane
new_paredmad_x_pisoother_x_techootro
new_paredmad_x_pisonatur_x_techozinc
new_paredmad_x_pisonatur_x_techoentrepiso
new_paredmad_x_pisonatur_x_techocane
new_paredmad_x_pisonatur_x_techootro
new_paredmad_x_pisonotiene_x_techoentrepiso
new_paredmad_x_pisonotiene_x_techocane
new_paredmad_x_pisonotiene_x_techootro
new_paredmad_x_pisomadera_x_techocane
new_paredmad_x_pisomadera_x_techootro
new_paredzinc_x_pisomoscer_x_techoentrepiso
new_paredzinc_x_pisomoscer_x_techocane
new_paredzinc_x_pisomoscer_x_techootro
new_paredzinc_x_pisocemento_x_techoentrepiso
new_paredzinc_x_pisocemento_x_techocane
new_paredzinc_x_pisocemento_x_techootro
new_paredzinc_x_pisoother_x_techozinc
new_paredzinc_x_pisoother_x_techoentrepiso
new_paredzinc_x_pisoother_x_techocane
new_paredzinc_x_pisoother_x_techootro
new_paredzinc_x_pisonatur_x_techozinc
new_paredzinc_x_pisonatur_x_techoentrepiso
new_paredzinc_x_pisonatur_x_techocane
new_paredzinc_x_pisonatur_x_techootro
new_paredzinc_x_pisonotiene_x_techoentrepiso
new_paredzinc_x_pisonotiene_x_techocane
new_paredzinc_x_pisonotiene_x_techootro
new_paredzinc_x_pisomadera_x_techoentrepiso
new_paredzinc_x_pisomadera_x_techocane
new_paredzinc_x_pisomadera_x_techootro
new_paredfibras_x_pisomoscer_x_techoentrepiso
new_paredfibras_x_pisomoscer_x_techocane
new_paredfibras_x_pisomoscer_x_techootro
new_paredfibras_x_pisocemento_x_techoentrepiso
new_paredfibras_x_pisocemento_x_techocane
new_paredfibras_x_pisocemento_x_techootro
new_paredfibras_x_pisoother_x_techozinc
new_paredfibras_x_pisoother_x_techoentrepiso
new_paredfibras_x_pisoother_x_techocane
new_paredfibras_x_pisoother_x_techootro
new_paredfibras_x_pisonatur_x_techozinc
new_paredfibras_x_pisonatur_x_techoentrepiso
new_paredfibras_x_pisonatur_x_techocane
new_paredfibras_x_pisonatur_x_techootro
new_paredfibras_x_pisonotiene_x_techozinc
new_paredfibras_x_pisonotiene_x_techoentrepiso
new_paredfibras_x_pisonotiene_x_techocane
new_paredfibras_x_pisonotiene_x_techootro
new_paredfibras_x_pisomadera_x_techozinc
new_paredfibras_x_pisomadera_x_techoentrepiso
new_paredfibras_x_pisomadera_x_techocane
new_paredfibras_x_pisomadera_x_techootro
new_paredother_x_pisomoscer_x_techoentrepiso
new_paredother_x_pisomoscer_x_techocane
new_paredother_x_pisomoscer_x_techootro
new_paredother_x_pisocemento_x_techoentrepiso
new_paredother_x_pisocemento_x_techocane
new_paredother_x_pisocemento_x_techootro
new_paredother_x_pisoother_x_techozinc
new_paredother_x_pisoother_x_techoentrepiso
new_paredother_x_pisoother_x_techocane
new_paredother_x_pisoother_x_techootro
new_paredother_x_pisonatur_x_techozinc
new_paredother_x_pisonatur_x_techoentrepiso
new_paredother_x_pisonatur_x_techocane
new_paredother_x_pisonatur_x_techootro
new_paredother_x_pisonotiene_x_techozinc
new_paredother_x_pisonotiene_x_techoentrepiso
new_paredother_x_pisonotiene_x_techocane
new_paredother_x_pisonotiene_x_techootro
new_paredother_x_pisomadera_x_techozinc
new_paredother_x_pisomadera_x_techoentrepiso
new_paredother_x_pisomadera_x_techocane
new_paredother_x_pisomadera_x_techootro
In [75]:
df_train.drop(col_with_only_one_value, axis=1, inplace=True)
df_test.drop(col_with_only_one_value, axis=1, inplace=True)
 

train과 test 데이터의 변수가 개수가 같은지 체크한다.

In [76]:
col_train = np.array(sorted([col for col in df_train.columns if col != 'Target']))
col_test = np.array(sorted(df_test.columns))
In [77]:
(col_train == col_test).sum() == len(col_train)
Out[77]:
True
 

2.4 집합 변수

aggregation features

In [78]:
def max_min(x):
    return x.max() - x.min()
In [79]:
agg_train = pd.DataFrame()
agg_test = pd.DataFrame()

for item in tqdm(family_size_features):
    for i, function in enumerate(['mean','std','min','max','sum','count', max_min]):
        group_train = df_train[item].groupby(df_train['idhogar']).agg(function)
        group_test = df_test[item].groupby(df_test['idhogar']).agg(function)
        if i == 6:
            new_col = item + '_new_' + 'max_min'
        else:
            new_col = item + '_new_' + function
            
        agg_train[new_col] = group_train
        agg_test[new_col] = group_test
        
print('new aggregate train set has {} rows, and {} features'.format(agg_train.shape[0], agg_train.shape[1]))
print('new aggregate test set has {} rows, and {} features'.format(agg_test.shape[0], agg_test.shape[1]))
 
100%|██████████| 15/15 [00:09<00:00,  1.63it/s]
 
new aggregate train set has 2988 rows, and 105 features
new aggregate test set has 7352 rows, and 105 features
 
In [80]:
# 집합 변수 리스트 선언
aggr_list = ['rez_esc', 'dis', 'male', 'female', 
              'estadocivil1', 'estadocivil2', 'estadocivil3', 'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 
              'parentesco2', 'parentesco3', 'parentesco4', 'parentesco5', 'parentesco6', 'parentesco7', 'parentesco8', 'parentesco9', 
             'parentesco10', 'parentesco11', 'parentesco12',
              'instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9',
             'epared1', 'epared2', 'epared3', 'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 'refrig', 'television', 'mobilephone',
             'area1', 'area2', 'v18q', 'edjef']
In [81]:
for item in tqdm(aggr_list):
    for function in ['count','sum']:
        group_train = df_train[item].groupby(df_train['idhogar']).agg(function)
        group_test = df_test[item].groupby(df_test['idhogar']).agg(function)
        new_col = item + '_new1_' + function
        agg_train[new_col] = group_train
        agg_test[new_col] = group_test
print('new aggregate train set has {} rows, and {} features'.format(agg_train.shape[0], agg_train.shape[1]))
print('new aggregate test set has {} rows, and {} features'.format(agg_test.shape[0], agg_test.shape[1]))
    
 
100%|██████████| 47/47 [00:01<00:00, 40.94it/s]
 
new aggregate train set has 2988 rows, and 199 features
new aggregate test set has 7352 rows, and 199 features
 
In [82]:
aggr_list = ['escolari', 'age', 'escolari_age', 'dependency', 'bedrooms', 'overcrowding', 'rooms', 'qmobilephone', 'v18q1']

for item in tqdm(aggr_list):
    for function in ['mean','std','min','max','sum','count', max_min]:
        group_train = df_train[item].groupby(df_train['idhogar']).agg(function)
        group_test = df_test[item].groupby(df_test['idhogar']).agg(function)
        if i == 6:
            new_col = item + '_new2_' + 'max_min'
        else:
            new_col = item + '_new2_' + function            
        agg_train[new_col] = group_train
        agg_test[new_col] = group_test
        
print('new aggregate train set has {} rows, and {} features'.format(agg_train.shape[0], agg_train.shape[1]))
print('new aggregate test set has {} rows, and {} features'.format(agg_test.shape[0], agg_test.shape[1]))
 
100%|██████████| 9/9 [00:05<00:00,  1.56it/s]
 
new aggregate train set has 2988 rows, and 208 features
new aggregate test set has 7352 rows, and 208 features
 
In [83]:
agg_test = agg_test.reset_index()
agg_train = agg_train.reset_index()

train_agg = pd.merge(df_train, agg_train, on='idhogar')
test = pd.merge(df_test, agg_test, on='idhogar')

#fill all na as 0
train_agg.fillna(value=0, inplace=True)
test.fillna(value=0, inplace=True)

print('train shape:', train_agg.shape, 'test shape:', test.shape)
 
train shape: (9557, 733) test shape: (23856, 732)
In [84]:
aggr_list = ['rez_esc', 'dis', 'male', 'female', 
                  'estadocivil1', 'estadocivil2', 'estadocivil3', 'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 
                  'parentesco2', 'parentesco3', 'parentesco4', 'parentesco5', 'parentesco6', 'parentesco7', 'parentesco8', 'parentesco9', 'parentesco10', 
                  'parentesco11', 'parentesco12',
                  'instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9',
                 'epared1', 'epared2', 'epared3', 'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 'refrig', 'television', 'mobilephone',
            'area1', 'area2', 'v18q', 'edjef']
In [85]:
for lugar in ['lugar1', 'lugar2', 'lugar3', 'lugar4', 'lugar5','lugar6']:
    group_train = df_train[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).sum().reset_index()
    group_train.columns = [lugar, 'idhogar'] + ['new3_{}_idhogar_{}'.format(lugar,col) for col in group_train][2:]

    group_test = df_test[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).sum().reset_index()
    group_test.columns = [lugar, 'idhogar'] + ['new3_{}_idhogar_{}'.format(lugar,col) for col in group_test][2:]
    
    train_agg = pd.merge(train_agg, group_train, on =[lugar, 'idhogar'])
    test = pd.merge(test, group_test, on = [lugar, 'idhogar'])

print('train shape:', train_agg.shape, 'test shape:', test.shape)
 
train shape: (9557, 1015) test shape: (23856, 1014)
In [86]:
aggr_list = ['rez_esc', 'dis', 'male', 'female', 
                  'estadocivil1', 'estadocivil2', 'estadocivil3', 'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 
                  'parentesco2', 'parentesco3', 'parentesco4', 'parentesco5', 'parentesco6', 'parentesco7', 'parentesco8', 'parentesco9', 'parentesco10', 
                  'parentesco11', 'parentesco12',
                  'instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9',
                 'epared1', 'epared2', 'epared3', 'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 'refrig', 'television', 'mobilephone',
            'area1', 'area2', 'v18q', 'edjef']

for lugar in ['lugar1', 'lugar2', 'lugar3', 'lugar4', 'lugar5','lugar6']:
    group_train = df_train[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).sum().reset_index()
    group_train.columns = [lugar, 'idhogar'] + ['new4_{}_idhogar_{}'.format(lugar,col) for col in group_train][2:]

    group_test = df_test[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).sum().reset_index()
    group_test.columns = [lugar, 'idhogar'] + ['new4_{}_idhogar_{}'.format(lugar,col) for col in group_test][2:]
    
    train_agg = pd.merge(train_agg, group_train, on =[lugar, 'idhogar'])
    test = pd.merge(test, group_test, on = [lugar, 'idhogar'])

print('train shape:', train_agg.shape, 'test shape:', test.shape)
 
train shape: (9557, 1297) test shape: (23856, 1296)
In [87]:
cols_nums = ['age', 'meaneduc', 'dependency', 
             'hogar_nin', 'hogar_adul', 'hogar_mayor', 'hogar_total',
             'bedrooms', 'overcrowding']

for function in tqdm(['mean','std','min','max','sum','count', max_min]):
    for lugar in ['lugar1', 'lugar2', 'lugar3', 'lugar4', 'lugar5','lugar6']:
        group_train = df_train[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).agg(function).reset_index()
        group_train.columns = [lugar, 'idhogar'] + ['new5_{}_idhogar_{}_{}'.format(lugar, col, function) for col in group_train][2:]        

        group_test = df_test[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).agg(function).reset_index()
        group_test.columns = [lugar, 'idhogar'] + ['new5_{}_idhogar_{}_{}'.format(lugar, col, function) for col in group_test][2:]        

        train_agg = pd.merge(train_agg, group_train, on =[lugar, 'idhogar'])
        test = pd.merge(test, group_test, on = [lugar, 'idhogar'])

print('train shape:', train_agg.shape, 'test shape:', test.shape)
 
100%|██████████| 7/7 [02:33<00:00, 21.93s/it]
 
train shape: (9557, 3271) test shape: (23856, 3270)
 
 
  • 데이터 설명에 따르면 점수 매기는 것은 household만 사용된다.
  • 모든 가구원은 test + sample submission 에는 포함되지만, 가구원수만 점수에 매겨진다.
In [88]:
train = train_agg.query('parentesco1==1')
In [89]:
train['dependency'].replace(np.inf, 0, inplace=True)
test['dependency'].replace(np.inf, 0, inplace=True)
In [90]:
submission = test[['Id']]

# 필요없는 변수는 차원을 감소 시키기위해 삭제한다.
train.drop(columns=['idhogar','Id', 'agesq', 'hogar_adul', 'SQBescolari', 'SQBage', 
                    'SQBhogar_total', 'SQBedjefe', 'SQBhogar_nin', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned'], inplace=True )

test.drop(columns=['idhogar','Id', 'agesq', 'hogar_adul', 'SQBescolari', 'SQBage', 
                    'SQBhogar_total', 'SQBedjefe', 'SQBhogar_nin', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned'], inplace=True )

correlation = train.corr()
correlation = correlation['Target'].sort_values(ascending=False)
In [91]:
print('final_data size', train.shape, test.shape)
 
final_data size (2973, 3259) (23856, 3258)
In [92]:
print(f'The most 20 positive feature: \n {correlation.head(20)}')
 
The most 20 positive feature: 
 Target                            1.000000
new5_lugar3_idhogar_edjef_max     0.334254
new5_lugar4_idhogar_edjef_max     0.334254
new5_lugar6_idhogar_edjef_max     0.334254
new5_lugar2_idhogar_edjef_max     0.334254
new5_lugar1_idhogar_edjef_max     0.334254
new5_lugar5_idhogar_edjef_max     0.334254
new5_lugar6_idhogar_edjef_mean    0.333873
new5_lugar5_idhogar_edjef_mean    0.333873
new5_lugar2_idhogar_edjef_mean    0.333873
new5_lugar4_idhogar_edjef_mean    0.333873
new5_lugar1_idhogar_edjef_mean    0.333873
new5_lugar3_idhogar_edjef_mean    0.333873
edjef                             0.333791
new5_lugar5_idhogar_edjef_min     0.333791
new5_lugar3_idhogar_edjef_min     0.333791
new5_lugar1_idhogar_edjef_min     0.333791
new5_lugar4_idhogar_edjef_min     0.333791
new5_lugar2_idhogar_edjef_min     0.333791
new5_lugar6_idhogar_edjef_min     0.333791
Name: Target, dtype: float64
In [93]:
print(f'The most 20 negaitive feature: \n {correlation.tail(20)}')
 
The most 20 negaitive feature: 
 new5_lugar5_idhogar_television_<function max_min at 0x7f61ca6fc0d0>    NaN
new5_lugar5_idhogar_mobilephone_<function max_min at 0x7f61ca6fc0d0>   NaN
new5_lugar5_idhogar_area1_<function max_min at 0x7f61ca6fc0d0>         NaN
new5_lugar5_idhogar_area2_<function max_min at 0x7f61ca6fc0d0>         NaN
new5_lugar5_idhogar_v18q_<function max_min at 0x7f61ca6fc0d0>          NaN
new5_lugar6_idhogar_epared1_<function max_min at 0x7f61ca6fc0d0>       NaN
new5_lugar6_idhogar_epared2_<function max_min at 0x7f61ca6fc0d0>       NaN
new5_lugar6_idhogar_epared3_<function max_min at 0x7f61ca6fc0d0>       NaN
new5_lugar6_idhogar_etecho1_<function max_min at 0x7f61ca6fc0d0>       NaN
new5_lugar6_idhogar_etecho2_<function max_min at 0x7f61ca6fc0d0>       NaN
new5_lugar6_idhogar_etecho3_<function max_min at 0x7f61ca6fc0d0>       NaN
new5_lugar6_idhogar_eviv1_<function max_min at 0x7f61ca6fc0d0>         NaN
new5_lugar6_idhogar_eviv2_<function max_min at 0x7f61ca6fc0d0>         NaN
new5_lugar6_idhogar_eviv3_<function max_min at 0x7f61ca6fc0d0>         NaN
new5_lugar6_idhogar_refrig_<function max_min at 0x7f61ca6fc0d0>        NaN
new5_lugar6_idhogar_television_<function max_min at 0x7f61ca6fc0d0>    NaN
new5_lugar6_idhogar_mobilephone_<function max_min at 0x7f61ca6fc0d0>   NaN
new5_lugar6_idhogar_area1_<function max_min at 0x7f61ca6fc0d0>         NaN
new5_lugar6_idhogar_area2_<function max_min at 0x7f61ca6fc0d0>         NaN
new5_lugar6_idhogar_v18q_<function max_min at 0x7f61ca6fc0d0>          NaN
Name: Target, dtype: float64

 


github 소스코드 바로가기