이번 주제는 Costa Rican Household Proverty 로,
목표는 미주 개발 은행(Inter-American Development Bank)의 세계에서 가장 빈곤 한 일부 가정의 소득 자격을 예측을 하는 것이다.
보통 세계 최 빈곤층은 그들의 자격을 증명하기가 어려운데, 라틴 아메리카는 알고리즘을 통해 소득자격을 확인한다.
예를 들어 프록시 수단 테스트(PMT)을 통해 벽과 천장의 재료 또는 집에서 발견 된 자산과 같은 가족의 관찰 가능한 가구 속성을 고려하는 것이다.
이를 바탕으로 다양한 feature가 제공 되었는데, LGBMClassifier를 사용하여 소득 자격을 예측해본다.
이번 필사는 이유한님의 코드를 참고하였다.
목록
Costa Rican Household Proverty (1)
1. Check datasets
1.1 Read datasets
1.2 Make description df
1.3 Check null data
1.4 Fill missing values
Costa Rican Household Proverty (2)
2. Feature Engineering
2.1 Object features
2.1.1 dependency
2.1.2 edjefe
2.1.3 edjefa
2.1.4 roof and elecricity
2.2 카테고리 변수 추출
2.3 연속형 변수를 사용하여 새로운 변수 생성
2.3.1 연속형 변수 컬럼 추출
2.3.2 새로운 변수 생성
2.3.3 가족 변수의 대출 비율
2.3.4 가족 변수의 방 비율
2.3.5 가족 변수의 침대 비율
2.3.6 가족 변수의 태블릿 보유 비율
2.3.7 가족 변수의 핸드폰 보유 비율
2.3.8 가족 변수의 학창 시절의 몇년뒤
2.3.9 Rich features
2.4 집합 변수
Costa Rican Household Proverty (3)
3. Feature Selection Using shap
4. Model Development
4.1 LGB를 통한 예측 및 변수 중요도 생성
4.2 랜덤하게 찾기 (Randomized Search)
Costa Rican Household Poverty Level Prediction
2. Feature Engineering
2.1 Object features
features_object = [col for col in df_train.columns if df_train[col].dtype == 'object']
features_object
dependency
df_train['dependency'] = np.sqrt(df_train['SQBdependency'])
df_test['dependency'] = np.sqrt(df_test['SQBdependency'])
edjefe
- 교육연수가 있다는 가정하에 남자 가장의 교육 연수를 뜻한다.
- yes ->1 , no ->0으로 대체한다.
def replace_edjefe(x):
if x == 'yes':
return 1
elif x == 'no':
return 0
else:
return x
df_train['edjefe'] = df_train['edjefe'].apply(replace_edjefe).astype(float)
df_test['edjefe'] = df_test['edjefe'].apply(replace_edjefe).astype(float)
edjefa
- 교육연수가 있다는 가정하에 여자 가장의 교육연수를 의미한다.
- yes->1, no->0으로 대체한다.
def replace_edjefa(x):
if x == 'yes':
return 1
elif x == 'no':
return 0
else:
return x
df_train['edjefa'] = df_train['edjefa'].apply(replace_edjefa).astype(float)
df_test['edjefa'] = df_test['edjefa'].apply(replace_edjefa).astype(float)
# 여자나 남자 중에 가장의 교육연수가 가장 큰 값으로 변수를 생성한다.
df_train['edjef'] = np.max(df_train[['edjefa','edjefe']], axis=1)
df_test['edjef'] = np.max(df_test[['edjefa','edjefe']], axis=1)
roof and electricity
# 초기화
df_train['roof_waste_material'] = np.nan
df_test['roof_waste_material'] = np.nan
df_train['electricity_other'] = np.nan
df_test['electricity_other'] = np.nan
def fill_roof_exception(x):
if (x['techozinc'] == 0) and (x['techoentrepiso'] == 0) and (x['techocane'] == 0) and (x['techootro'] == 0) :
return 1
else:
return 0
def fill_no_electricity(x):
if (x['public'] == 0) and (x['planpri'] == 0) and (x['noelec'] == 0) and (x['coopele'] == 0) :
return 1
else:
return 0
df_train['roof_waste_material'] = df_train.apply(lambda x : fill_roof_exception(x), axis=1)
df_test['roof_waste_material'] = df_test.apply(lambda x : fill_roof_exception(x), axis=1)
df_train['electricity_other'] = df_train.apply(lambda x : fill_no_electricity(x), axis=1)
df_test['electricity_other'] = df_test.apply(lambda x : fill_no_electricity(x), axis=1)
2.2 카테고리 변수 추출
binary_cat_features = [col for col in df_train.columns if df_train[col].value_counts().shape[0] == 2]
len(binary_cat_features) # 이진 카테고리 변수 추출
2.3 연속형 변수를 사용하여 새로운 변수 생성
연속형 변수 컬럼 추출
continuous_features = [col for col in df_train.columns if col not in binary_cat_features] # unique한 value가 2개일 경우(이진 값)
continuous_features = [col for col in continuous_features if col not in features_object]
continuous_features = [col for col in continuous_features if col not in ['Id', 'Target','idhogar']]
print('Threr are {} continuous features'.format(len(continuous_features)))
for col in continuous_features:
print('{}: {}'.format(col, description_ko.loc[description_ko['varname'] == col, 'description'].values))
- hhsize : household size
- tamhog : size of the household
두변수의 의미가 같으므로, tamhog는 삭제한다.
df_train.drop('tamhog', axis=1, inplace=True)
df_test.drop('tamhog', axis=1, inplace=True)
새로운 변수 생성
-
Squared features
- 많은 squared features 가 있다. 사실은 lightgbm과 같은 트리 모델에서는 그들이 필요 없다. 하지만 커널에서는 우리는 classfier로써 임배딩되어 있는 features 필터와 set entity로써는 lightgbm으로 사용할 수 있다. 그러니 지금은 그냥 두자.
-
Family features
- hogar_nin, hogar_adul, hogar_mayor, hogar_total, r4h1, r4h2, r4h3, r4m1, r4m2, r4m3, r4t1, r4t2, r4t3, tmbhog, tamvid, rez_esc, escolari
- Family size features(추출 및 비율구하기)
df_train['adult']= df_train['hogar_adul'] - df_train['hogar_mayor']# 65이하 성인 명수
df_train['dependency_count'] = df_train['hogar_nin'] + df_train['hogar_mayor']# 65이상 노인과 0~19살 어린이,청소년
df_train['dependency'] = df_train['dependency_count'] / df_train['adult']# 의존률 노약자/성인
df_train['child_percent'] = df_train['hogar_nin'] / df_train['hogar_total']# 어린이 비율
df_train['elder_percent'] = df_train['hogar_mayor'] / df_train['hogar_total']# 노인 비율
df_train['adult_percent'] = df_train['hogar_adul'] / df_train['hogar_total'] # 성인 비율
df_train['males_younger_12_years_percent'] = df_train['r4h1'] / df_train['hogar_total'] # 12세 이하 남자 비율
df_train['males_older_12_years_percent'] = df_train['r4h2'] / df_train['hogar_total'] # 12세 이상 남자 비율
df_train['males_percent'] = df_train['r4h3'] / df_train['hogar_total'] # 총 가구원수의 남자 비율
df_train['females_younger_12_years_percent'] = df_train['r4m1'] / df_train['hogar_total'] # 12세 이하 여자 비율
df_train['females_older_12_years_percent'] = df_train['r4m2'] / df_train['hogar_total'] # 12세 이상 여자 비율
df_train['females_percent'] = df_train['r4m3'] / df_train['hogar_total'] # 총 가구원수의 여자 비율
df_train['persons_younger_12_years_percent'] = df_train['r4t1'] / df_train['hogar_total'] # 12세 이하 가구원수 비율
df_train['persons_older_12_years_percent'] = df_train['r4t2'] / df_train['hogar_total'] # 12세 이상 가구원수 비율
df_train['persons_percent'] = df_train['r4t3'] / df_train['hogar_total'] # 총 성인의 가구원수 비율
# 남자, 여자 , 총인원의 가구원수의 비율
df_test['adult']= df_test['hogar_adul'] - df_test['hogar_mayor'] # 65이하 성인 명수
df_test['dependency_count'] = df_test['hogar_nin'] + df_test['hogar_mayor'] # 65이상 노인과 0~19살 어린이,청소년
df_test['dependency'] = df_test['dependency_count'] / df_test['adult'] # 의존률 노약자/성인
df_test['child_percent'] = df_test['hogar_nin'] / df_test['hogar_total'] # 어린이 비율
df_test['elder_percent'] = df_test['hogar_mayor'] / df_test['hogar_total'] # 노인 비율
df_test['adult_percent'] = df_test['hogar_adul'] / df_test['hogar_total'] # 성인 비율
df_test['males_younger_12_years_percent'] = df_test['r4h1'] / df_test['hogar_total'] # 12세 이하 남자 비율
df_test['males_older_12_years_percent'] = df_test['r4h2'] / df_test['hogar_total'] # 12세 이상 남자 비율
df_test['males_percent'] = df_test['r4h3'] / df_test['hogar_total']# 총 가구원수의 남자 비율
df_test['females_younger_12_years_percent'] = df_test['r4m1'] / df_test['hogar_total']# 12세 이하 여자 비율
df_test['females_older_12_years_percent'] = df_test['r4m2'] / df_test['hogar_total']# 12세 이상 여자 비율
df_test['females_percent'] = df_test['r4m3'] / df_test['hogar_total']# 총 가구원수의 여자 비율
df_test['persons_younger_12_years_percent'] = df_test['r4t1'] / df_test['hogar_total'] # 12세 이하 가구원수 비율
df_test['persons_older_12_years_percent'] = df_test['r4t2'] / df_test['hogar_total']# 12세 이상 가구원수 비율
df_test['persons_percent'] = df_test['r4t3'] / df_test['hogar_total'] # 총 성인의 가구원수 비율
# 남자, 여자 , 총인원의 가구원수의 비율
df_train['males_younger_12_years_in_householde_size'] = df_train['r4h1'] / df_train['hhsize']
df_train['males_older_12_years_in_householde_size'] = df_train['r4h2'] / df_train['hhsize']
df_train['males_in_householde_size'] = df_train['r4h3'] / df_train['hhsize']
df_train['females_younger_12_years_in_householde_size'] = df_train['r4m1'] / df_train['hhsize']
df_train['females_older_12_years_in_householde_size'] = df_train['r4m2'] / df_train['hhsize']
df_train['females_in_householde_size'] = df_train['r4m3'] / df_train['hogar_total']
df_train['persons_younger_12_years_in_householde_size'] = df_train['r4t1'] / df_train['hhsize']
df_train['persons_older_12_years_in_householde_size'] = df_train['r4t2'] / df_train['hhsize']
df_train['persons_in_householde_size'] = df_train['r4t3'] / df_train['hhsize']
# 남자, 여자 , 총인원의 가구원수의 비율
df_test['males_younger_12_years_in_householde_size'] = df_test['r4h1'] / df_test['hhsize']
df_test['males_older_12_years_in_householde_size'] = df_test['r4h2'] / df_test['hhsize']
df_test['males_in_householde_size'] = df_test['r4h3'] / df_test['hhsize']
df_test['females_younger_12_years_in_householde_size'] = df_test['r4m1'] / df_test['hhsize']
df_test['females_older_12_years_in_householde_size'] = df_test['r4m2'] / df_test['hhsize']
df_test['females_in_householde_size'] = df_test['r4m3'] / df_test['hogar_total']
df_test['persons_younger_12_years_in_householde_size'] = df_test['r4t1'] / df_test['hhsize']
df_test['persons_older_12_years_in_householde_size'] = df_test['r4t2'] / df_test['hhsize']
df_test['persons_in_householde_size'] = df_test['r4t3'] / df_test['hhsize']
# 침실 평균 최대 수용인원
df_train['overcrowding_room_and_bedroom'] = (df_train['hacdor'] + df_train['hacapo'])/2
df_test['overcrowding_room_and_bedroom'] = (df_test['hacdor'] + df_test['hacapo'])/2
# 나이대별 교육 연수
df_train['escolari_age'] = df_train['escolari'] / df_train['age']
df_test['escolari_age'] = df_test['escolari'] / df_test['age']
df_train['age_12_19'] = df_train['hogar_nin'] - df_train['r4t1']
df_test['age_12_19'] = df_test['hogar_nin'] - df_test['r4t1']
# 실제 가구원수의 생활 방식 비율
df_train['phones-per-capita'] = df_train['qmobilephone'] / df_train['tamviv']
df_train['tablets-per-capita'] = df_train['v18q1'] / df_train['tamviv']
df_train['rooms-per-capita'] = df_train['rooms'] / df_train['tamviv']
df_train['rent-per-capita'] = df_train['v2a1'] / df_train['tamviv']
df_test['phones-per-capita'] = df_test['qmobilephone'] / df_test['tamviv']
df_test['tablets-per-capita'] = df_test['v18q1'] / df_test['tamviv']
df_test['rooms-per-capita'] = df_test['rooms'] / df_test['tamviv']
df_test['rent-per-capita'] = df_test['v2a1'] / df_test['tamviv']
- 우리는 "Total persons in the household"와 "# of total individuals in the household"가 같지 않음을 확인할 수 있다.
- 조금 이상하긴 하지만 지금은 그냥 둔다
(df_train['hogar_total'] == df_train['r4t3']).sum()
가족 변수의 대출 비율
Rent per family features
family_size_features= ['adult', 'hogar_adul', 'hogar_mayor', 'hogar_nin', 'hogar_total', 'r4h1',
'r4h2', 'r4h3', 'r4m1', 'r4m2', 'r4m3', 'r4t1', 'r4t2', 'r4t3', 'hhsize']
new_feats = []
for col in family_size_features:
new_col_name = 'new_{}_per_{}'.format('v2a1', col)
new_feats.append(new_col_name)
df_train[new_col_name] = df_train['v2a1'] / df_train[col]
df_test[new_col_name] = df_test['v2a1'] / df_test[col]
- Ratio 변수는 값이 무한으로 갈수 있다. 그런 값들은 0으로 처리한다.
for col in new_feats:
df_train[col].replace([np.inf], np.nan, inplace=True)
df_train[col].fillna(0, inplace=True)
df_test[col].replace([np.inf], np.nan, inplace=True)
df_test[col].fillna(0, inplace=True)
가족 변수의 방 비율
Room per family features
new_feats = []
for col in family_size_features:
new_col_name = 'new_{}_per_{}'.format('rooms', col)
new_feats.append(new_col_name)
df_train[new_col_name] = df_train['rooms'] / df_train[col]
df_test[new_col_name] = df_test['rooms'] / df_test[col]
for col in new_feats:
df_train[col].replace([np.inf], np.nan, inplace=True)
df_train[col].fillna(0, inplace=True)
df_test[col].replace([np.inf], np.nan, inplace=True)
df_test[col].fillna(0, inplace=True)
가족 변수의 침대 비율
BedRoom per family features
new_feats = []
for col in family_size_features:
new_col_name = 'new_{}_per_{}'.format('bedrooms', col)
new_feats.append(new_col_name)
df_train[new_col_name] = df_train['bedrooms'] / df_train[col]
df_test[new_col_name] = df_test['bedrooms'] / df_test[col]
for col in new_feats:
df_train[col].replace([np.inf], np.nan, inplace=True)
df_train[col].fillna(0, inplace=True)
df_test[col].replace([np.inf], np.nan, inplace=True)
df_test[col].fillna(0, inplace=True)
# feature 갯수 확인
print(df_train.shape, df_test.shape)
가족 변수의 태블릿 보유 비율
Tabulet per family features
new_feats = []
for col in family_size_features:
new_col_name = 'new_{}_per_{}'.format('v18q1', col)
new_feats.append(new_col_name)
df_train[new_col_name] = df_train['v18q1'] / df_train[col]
df_test[new_col_name] = df_test['v18q1'] / df_test[col]
for col in new_feats:
df_train[col].replace([np.inf], np.nan, inplace=True)
df_train[col].fillna(0, inplace=True)
df_test[col].replace([np.inf], np.nan, inplace=True)
df_test[col].fillna(0, inplace=True)
가족 변수의 핸드폰 보유 비율
phone per family features
new_feats = []
for col in family_size_features:
new_col_name = 'new_{}_per_{}'.format('qmobilephone', col)
new_feats.append(new_col_name)
df_train[new_col_name] = df_train['qmobilephone'] / df_train[col]
df_test[new_col_name] = df_test['qmobilephone'] / df_test[col]
for col in new_feats:
df_train[col].replace([np.inf], np.nan, inplace=True)
df_train[col].fillna(0, inplace=True)
df_test[col].replace([np.inf], np.nan, inplace=True)
df_test[col].fillna(0, inplace=True)
가족 변수의 학창 시절의 몇년뒤 비율
rez_esc(Years behind in school:학창 시절의 몇년 뒤) per family features
new_feats = []
for col in family_size_features:
new_col_name = 'new_{}_per_{}'.format('rez_esc', col)
new_feats.append(new_col_name)
df_train[new_col_name] = df_train['rez_esc'] / df_train[col]
df_test[new_col_name] = df_test['rez_esc'] / df_test[col]
for col in new_feats:
df_train[col].replace([np.inf], np.nan, inplace=True)
df_train[col].fillna(0, inplace=True)
df_test[col].replace([np.inf], np.nan, inplace=True)
df_test[col].fillna(0, inplace=True)
df_train['rez_esc_age'] = df_train['rez_esc'] / df_train['age']
df_train['rez_esc_escolari'] = df_train['rez_esc'] / df_train['escolari']
df_test['rez_esc_age'] = df_test['rez_esc'] / df_test['age']
df_test['rez_esc_escolari'] = df_test['rez_esc'] / df_test['escolari']
Rich features
- 필자는 phone이나 tabulet 많으면 부자라고 판단하고, 부자 관련 변수를 생성함
df_train['tabulet_x_qmobilephone'] = df_train['v18q1'] * df_train['qmobilephone']
df_test['tabulet_x_qmobilephone'] = df_test['v18q1'] * df_test['qmobilephone']
- wall(벽), roof(지붕), floor(바닥)도 중요한 키요소이다.
- 이제 변수들의 값을 곱해보자. 왜냐하면 그 변수들은 이진 카테고리 변수이기 때문이다. 그래서 각각의 변수들의 곱으로 새로운 카테고리 변수를 생성한다.
- "epared1"," =1 if walls are bad"
- "epared2"," =1 if walls are regular"
- "epared3"," =1 if walls are good"
- "etecho1"," =1 if roof are bad"
- "etecho2"," =1 if roof are regular"
- "etecho3"," =1 if roof are good"
- "eviv1"," =1 if floor are bad"
- "eviv2"," =1 if floor are regular"
- "eviv3"," =1 if floor are good"
# wall and roof
for col1 in ['epared1','epared2','epared3']:
for col2 in ['etecho1','etecho2','etecho3']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
# wall and floor
for col1 in ['epared1','epared2','epared3']:
for col2 in ['eviv1','eviv2','eviv3']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
# roof and floor
for col1 in ['etecho1','etecho2','etecho3']:
for col2 in ['eviv1','eviv2','eviv3']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
# 3개 변수 합치기
for col1 in ['epared1','epared2','epared3']:
for col2 in ['etecho1','etecho2','etecho3']:
for col3 in ['eviv1','eviv2','eviv3']:
new_col_name = 'new_{}_x_{}_x_{}'.format(col1, col2, col3)
df_train[new_col_name] = df_train[col1] * df_train[col2] * df_train[col3]
# df_test[new_col_name] = df_test[col1] * df_test[col2] * df_test[col3]
df_test[new_col_name] = df_test[col1] * df_test[col2] * df_train[col3]
print(df_train.shape, df_test.shape)
- electricity와 energy 변수는 energy 변수로 합친다.
for col1 in ['public', 'planpri', 'noelec', 'coopele']:
for col2 in ['energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
- 화장실과 쓰레기 처리변수는 other_infra 변수로 처리한다.
for col1 in ['sanitario1', 'sanitario2', 'sanitario3','sanitario5','sanitario6']:
for col2 in ['elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 'elimbasu5', 'elimbasu6']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
- 화장실과 물 공급은 water 변수로 처리한다.
for col1 in ['abastaguadentro', 'abastaguafuera', 'abastaguano']:
for col2 in ['sanitario1', 'sanitario2', 'sanitario3','sanitario5','sanitario6']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
print(df_train.shape, df_test.shape)
- 교육과 지역을 education_zone 변수로 처리한다.
for col1 in ['area1', 'area2']:
for col2 in ['instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6',
'instlevel7','instlevel8','instlevel9']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
- 종교와 교육을 합친다.
for col1 in ['lugar1', 'lugar2', 'lugar3', 'lugar4','lugar5', 'lugar6']:
for col2 in ['instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6',
'instlevel7','instlevel8','instlevel9']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
print(df_train.shape, df_test.shape)
- television / mobilephone / computer / tabulet / refrigerator 모두 곱한 값을 electronics 변수로 처리한다.
- television / mobilephone / computer / tabulet / refrigerator 모두 더한 값을 no_appliances 변수로 처리한다.
df_train['electronics'] = df_train['computer'] * df_train['mobilephone'] * df_train['television'] * df_train['v18q'] * df_train['refrig']
df_train['no_appliances'] = df_train['refrig'] + df_train['computer'] + df_train['television'] + df_train['refrig']
df_test['electronics'] = df_test['computer'] * df_test['mobilephone'] * df_test['television'] * df_test['v18q'] * df_test['refrig']
df_test['no_appliances'] = df_test['refrig'] + df_test['computer'] + df_test['television'] + df_test['refrig']
- roof, floor, wall의 재료 변수를 합친다.
for col1 in ['paredblolad', 'paredzocalo', 'paredpreb', 'pareddes','paredmad','paredzinc','paredfibras','paredother']:
for col2 in ['pisomoscer','pisocemento','pisoother','pisonatur','pisonotiene','pisomadera']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
for col1 in ['pisomoscer','pisocemento','pisoother','pisonatur','pisonotiene','pisomadera']:
for col2 in ['techozinc','techoentrepiso','techocane','techootro']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
for col1 in ['paredblolad', 'paredzocalo', 'paredpreb', 'pareddes','paredmad','paredzinc','paredfibras','paredother']:
for col2 in ['techozinc','techoentrepiso','techocane','techootro']:
new_col_name = 'new_{}_x_{}'.format(col1, col2)
df_train[new_col_name] = df_train[col1] * df_train[col2]
df_test[new_col_name] = df_test[col1] * df_test[col2]
for col1 in ['paredblolad', 'paredzocalo', 'paredpreb', 'pareddes','paredmad','paredzinc','paredfibras','paredother']:
for col2 in ['pisomoscer','pisocemento','pisoother','pisonatur','pisonotiene','pisomadera']:
for col3 in ['techozinc','techoentrepiso','techocane','techootro']:
new_col_name = 'new_{}_x_{}_x_{}'.format(col1, col2, col3)
df_train[new_col_name] = df_train[col1] * df_train[col2] * df_train[col3]
df_test[new_col_name] = df_test[col1] * df_test[col2] * df_test[col3]
print(df_train.shape, df_test.shape)
오직 한개의값만 있는 변수는 지우기
col_with_only_one_value=[]
for col in df_train.columns:
if col == 'Target':
continue
if df_train[col].value_counts().shape[0] == 1 or df_test[col].value_counts().shape[0] == 1 :
print(col)
col_with_only_one_value.append(col)
df_train.drop(col_with_only_one_value, axis=1, inplace=True)
df_test.drop(col_with_only_one_value, axis=1, inplace=True)
train과 test 데이터의 변수가 개수가 같은지 체크한다.
col_train = np.array(sorted([col for col in df_train.columns if col != 'Target']))
col_test = np.array(sorted(df_test.columns))
(col_train == col_test).sum() == len(col_train)
2.4 집합 변수
aggregation features
def max_min(x):
return x.max() - x.min()
agg_train = pd.DataFrame()
agg_test = pd.DataFrame()
for item in tqdm(family_size_features):
for i, function in enumerate(['mean','std','min','max','sum','count', max_min]):
group_train = df_train[item].groupby(df_train['idhogar']).agg(function)
group_test = df_test[item].groupby(df_test['idhogar']).agg(function)
if i == 6:
new_col = item + '_new_' + 'max_min'
else:
new_col = item + '_new_' + function
agg_train[new_col] = group_train
agg_test[new_col] = group_test
print('new aggregate train set has {} rows, and {} features'.format(agg_train.shape[0], agg_train.shape[1]))
print('new aggregate test set has {} rows, and {} features'.format(agg_test.shape[0], agg_test.shape[1]))
# 집합 변수 리스트 선언
aggr_list = ['rez_esc', 'dis', 'male', 'female',
'estadocivil1', 'estadocivil2', 'estadocivil3', 'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7',
'parentesco2', 'parentesco3', 'parentesco4', 'parentesco5', 'parentesco6', 'parentesco7', 'parentesco8', 'parentesco9',
'parentesco10', 'parentesco11', 'parentesco12',
'instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9',
'epared1', 'epared2', 'epared3', 'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 'refrig', 'television', 'mobilephone',
'area1', 'area2', 'v18q', 'edjef']
for item in tqdm(aggr_list):
for function in ['count','sum']:
group_train = df_train[item].groupby(df_train['idhogar']).agg(function)
group_test = df_test[item].groupby(df_test['idhogar']).agg(function)
new_col = item + '_new1_' + function
agg_train[new_col] = group_train
agg_test[new_col] = group_test
print('new aggregate train set has {} rows, and {} features'.format(agg_train.shape[0], agg_train.shape[1]))
print('new aggregate test set has {} rows, and {} features'.format(agg_test.shape[0], agg_test.shape[1]))
aggr_list = ['escolari', 'age', 'escolari_age', 'dependency', 'bedrooms', 'overcrowding', 'rooms', 'qmobilephone', 'v18q1']
for item in tqdm(aggr_list):
for function in ['mean','std','min','max','sum','count', max_min]:
group_train = df_train[item].groupby(df_train['idhogar']).agg(function)
group_test = df_test[item].groupby(df_test['idhogar']).agg(function)
if i == 6:
new_col = item + '_new2_' + 'max_min'
else:
new_col = item + '_new2_' + function
agg_train[new_col] = group_train
agg_test[new_col] = group_test
print('new aggregate train set has {} rows, and {} features'.format(agg_train.shape[0], agg_train.shape[1]))
print('new aggregate test set has {} rows, and {} features'.format(agg_test.shape[0], agg_test.shape[1]))
agg_test = agg_test.reset_index()
agg_train = agg_train.reset_index()
train_agg = pd.merge(df_train, agg_train, on='idhogar')
test = pd.merge(df_test, agg_test, on='idhogar')
#fill all na as 0
train_agg.fillna(value=0, inplace=True)
test.fillna(value=0, inplace=True)
print('train shape:', train_agg.shape, 'test shape:', test.shape)
aggr_list = ['rez_esc', 'dis', 'male', 'female',
'estadocivil1', 'estadocivil2', 'estadocivil3', 'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7',
'parentesco2', 'parentesco3', 'parentesco4', 'parentesco5', 'parentesco6', 'parentesco7', 'parentesco8', 'parentesco9', 'parentesco10',
'parentesco11', 'parentesco12',
'instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9',
'epared1', 'epared2', 'epared3', 'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 'refrig', 'television', 'mobilephone',
'area1', 'area2', 'v18q', 'edjef']
for lugar in ['lugar1', 'lugar2', 'lugar3', 'lugar4', 'lugar5','lugar6']:
group_train = df_train[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).sum().reset_index()
group_train.columns = [lugar, 'idhogar'] + ['new3_{}_idhogar_{}'.format(lugar,col) for col in group_train][2:]
group_test = df_test[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).sum().reset_index()
group_test.columns = [lugar, 'idhogar'] + ['new3_{}_idhogar_{}'.format(lugar,col) for col in group_test][2:]
train_agg = pd.merge(train_agg, group_train, on =[lugar, 'idhogar'])
test = pd.merge(test, group_test, on = [lugar, 'idhogar'])
print('train shape:', train_agg.shape, 'test shape:', test.shape)
aggr_list = ['rez_esc', 'dis', 'male', 'female',
'estadocivil1', 'estadocivil2', 'estadocivil3', 'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7',
'parentesco2', 'parentesco3', 'parentesco4', 'parentesco5', 'parentesco6', 'parentesco7', 'parentesco8', 'parentesco9', 'parentesco10',
'parentesco11', 'parentesco12',
'instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9',
'epared1', 'epared2', 'epared3', 'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 'refrig', 'television', 'mobilephone',
'area1', 'area2', 'v18q', 'edjef']
for lugar in ['lugar1', 'lugar2', 'lugar3', 'lugar4', 'lugar5','lugar6']:
group_train = df_train[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).sum().reset_index()
group_train.columns = [lugar, 'idhogar'] + ['new4_{}_idhogar_{}'.format(lugar,col) for col in group_train][2:]
group_test = df_test[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).sum().reset_index()
group_test.columns = [lugar, 'idhogar'] + ['new4_{}_idhogar_{}'.format(lugar,col) for col in group_test][2:]
train_agg = pd.merge(train_agg, group_train, on =[lugar, 'idhogar'])
test = pd.merge(test, group_test, on = [lugar, 'idhogar'])
print('train shape:', train_agg.shape, 'test shape:', test.shape)
cols_nums = ['age', 'meaneduc', 'dependency',
'hogar_nin', 'hogar_adul', 'hogar_mayor', 'hogar_total',
'bedrooms', 'overcrowding']
for function in tqdm(['mean','std','min','max','sum','count', max_min]):
for lugar in ['lugar1', 'lugar2', 'lugar3', 'lugar4', 'lugar5','lugar6']:
group_train = df_train[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).agg(function).reset_index()
group_train.columns = [lugar, 'idhogar'] + ['new5_{}_idhogar_{}_{}'.format(lugar, col, function) for col in group_train][2:]
group_test = df_test[[lugar, 'idhogar'] + aggr_list].groupby([lugar,'idhogar']).agg(function).reset_index()
group_test.columns = [lugar, 'idhogar'] + ['new5_{}_idhogar_{}_{}'.format(lugar, col, function) for col in group_test][2:]
train_agg = pd.merge(train_agg, group_train, on =[lugar, 'idhogar'])
test = pd.merge(test, group_test, on = [lugar, 'idhogar'])
print('train shape:', train_agg.shape, 'test shape:', test.shape)
- 데이터 설명에 따르면 점수 매기는 것은 household만 사용된다.
- 모든 가구원은 test + sample submission 에는 포함되지만, 가구원수만 점수에 매겨진다.
train = train_agg.query('parentesco1==1')
train['dependency'].replace(np.inf, 0, inplace=True)
test['dependency'].replace(np.inf, 0, inplace=True)
submission = test[['Id']]
# 필요없는 변수는 차원을 감소 시키기위해 삭제한다.
train.drop(columns=['idhogar','Id', 'agesq', 'hogar_adul', 'SQBescolari', 'SQBage',
'SQBhogar_total', 'SQBedjefe', 'SQBhogar_nin', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned'], inplace=True )
test.drop(columns=['idhogar','Id', 'agesq', 'hogar_adul', 'SQBescolari', 'SQBage',
'SQBhogar_total', 'SQBedjefe', 'SQBhogar_nin', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned'], inplace=True )
correlation = train.corr()
correlation = correlation['Target'].sort_values(ascending=False)
print('final_data size', train.shape, test.shape)
print(f'The most 20 positive feature: \n {correlation.head(20)}')
print(f'The most 20 negaitive feature: \n {correlation.tail(20)}')
'Competition > Kaggle' 카테고리의 다른 글
[kaggle][필사] Costa Rican Household Proverty (3) (0) | 2020.09.22 |
---|---|
[kaggle][필사] Costa Rican Household Proverty (1) (0) | 2020.09.19 |
[kaggle][필사] Porto serqruo safe prediction(Gabriel Preda) (3) (0) | 2020.09.10 |