이번 주제는 Porto serqruo safe prediction 로,
목표는 운전자가 내년에 자동차 보험 청구를 시작할 확률울 예측하는 모델을 구축 하는 것이다.
이번 필사는 Gabriel Preda님의 코드를 참고하였다.
총 3가지 포스트로 내용을 나누었고, 순서는 아래와 같다.
Porto serqruo safe prediction(Gabriel Preda) (1)
더보기
1. 데이터 분석 준비
2. 데이터 설명
3. Metadata 설명
Porto serqruo safe prediction(Gabriel Preda) (2)
더보기
4. 데이터 분석과 통계
Porto serqruo safe prediction(Gabriel Preda) (3)
더보기
5. 모델을 위한 데이터 준비
6. 모델 준비
7. 예측 모델 실행
5 모델을 위한 데이터 준비
1) 결측치 확인
In [94]:
vars_with_missing = []
for feature in trainset.columns:
missings = trainset[trainset[feature] == -1][feature].count()
if missings > 0:
vars_with_missing.append(feature)
missings_perc = missings/trainset.shape[0]
print('Variable {} has {} records ({:.2%}) with missing values'.format(feature, missings, missings_perc))
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))
In [95]:
train_copy = trainset
train_copy = train_copy.replace(-1, np.NaN)
msno.matrix(df=train_copy.iloc[:, 2:39], figsize=(15,10), color=(0.42, 0.1, 0.05))
Out[95]:
2) 불필요 컬럼 삭제
calc
컬럼 삭제- CV 점수를 성공적으로 내기 위해서
In [96]:
col_to_drop = trainset.columns[trainset.columns.str.startswith('ps_clac_')]
trainset = trainset.drop(col_to_drop, axis=1)
testset = testset.drop(col_to_drop, axis=1)
- 결측치가 많은 변수는 삭제한다.
In [97]:
# 위의 6.1 결측치 확인 참고
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
trainset.drop(vars_to_drop, inplace=True, axis=1)
testset.drop(vars_to_drop, inplace=True, axis=1)
metadata.loc[(vars_to_drop), 'preserve'] = False
3) 변수값 대체
In [98]:
def add_noise(series, noise_level):
return series * (1 + noise_level * np.random.randn(len(series)))
def target_encode(trn_series=None,
tst_series=None,
target=None,
min_samples_leaf=1,
smoothing=1,
noise_level=0):
assert len(trn_series) == len(target)
assert trn_series.name == tst_series.name
temp = pd.concat([trn_series, target], axis=1)
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean","count"])
smoothing = 1 / (1+np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
prior = target.mean()
averages[target.name] = prior * (1-smoothing) + averages["mean"]*smoothing
averages.drop(["mean", "count"], axis=1, inplace=True)
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
averages.reset_index().rename(columns={'index' : target.name, target.name:'average'}),
on=trn_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
ft_trn_series.index = trn_series.index
ft_tst_series = pd.merge(
tst_series.to_frame(tst_series.name),
averages.reset_index().rename(columns={'index':target.name, target.name:'average'}),
on = tst_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
ft_tst_series.index = tst_series.index
return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
ps_car_11_cat 을 encoded value로 대체한다.
In [99]:
train_encoded, test_encoded = target_encode(trainset['ps_car_11_cat'],
testset['ps_car_11_cat'],
target=trainset.target,
min_samples_leaf=100,
smoothing=10,
noise_level=0.01)
In [100]:
trainset['ps_cat_11_cat_te'] = train_encoded
trainset.drop('ps_car_11_cat', axis=1, inplace=True)
metadata.loc['ps_car_11_cat', 'preserve'] = False
testset['ps_car_11_cat_te'] = test_encoded
testset.drop('ps_car_11_cat', axis=1, inplace=True)
4) target value의 균형 맞추기
In [101]:
desired_apriori = 0.10
idx_0 = trainset[trainset.target == 0].index
idx_1 = trainset[trainset.target == 1].index
nb_0= len(trainset.loc[idx_0]) # target이 0인 갯수
nb_1= len(trainset.loc[idx_1]) # target이 1인 갯수
undersampling_rate = ((1-desired_apriori)*nb_1) / (nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))
undersampled_idx = shuffle(idx_0, random_state=314, n_samples=undersampled_nb_0)
idx_list = list(undersampled_idx) + list(idx_1)
trainset = trainset.loc[idx_list].reset_index(drop=True)
5) -1을 na.NaN으로 대체
In [102]:
trainset = trainset.replace(-1, np.nan)
testset = testset.replace(-1, np.nan)
6) cat 변수 dummy 변수화
In [104]:
cat_feature = [a for a in trainset.columns if a.endswith('cat')]
for column in cat_feature:
temp = pd.get_dummies(pd.Series(trainset[column]))
trainset = pd.concat([trainset, temp], axis=1)
trainset = trainset.drop([column], axis=1)
for column in cat_feature:
temp = pd.get_dummies(pd.Series(testset[column]))
testset = pd.concat([testset, temp], axis=1)
testset = testset.drop([column], axis=1)
In [106]:
# 사용하지 않는 target 컬럼 삭제
id_test = testset['id'].values
target_train = trainset['target'].values
trainset = trainset.drop(['target','id'], axis=1)
testset = testset.drop(['id'], axis=1)
In [107]:
# test set 확인
print("Train dataset (rows, cols):", trainset.values.shape, "\nTest dataset (rows, cols):", testset.values.shape)
6. 모델 준비
1) 교차 검증과 앙상블을 위한 앙상블 클래스
KFolds로 분리하여 앙상블 클래스를 준비하기 위해, model을 교육하고 결과를 ensamble한다.
이 클래스에는 4가지 파라미터가 있다.
- self : 초기화할 객체
- n_message : 사용할 교차 분할의 수
- stacker : 훈련된 기본 모델에서 예측 결과를 쌓는 데 사용되는 모델
- base_models : 교육에 사용된 기본 모델 목록
n_splits 폴드 만큼 training 데이터를 분할한다.
In [139]:
class Ensemble(object):
def __init__(self, n_splits, stacker, base_models):
self.n_splits = n_splits
self.stacker = stacker
self.base_models = base_models
def fit_predict(self, X, y, T):
X = np.array(X)
y = np.array(y)
T = np.array(T)
folds = list(StratifiedKFold(n_splits=self.n_splits, shuffle=True, random_state=314).split(X, y))
S_train = np.zeros((X.shape[0], len(self.base_models)))
S_test = np.zeros((T.shape[0], len(self.base_models)))
for i, clf in enumerate(self.base_models):
S_test_i = np.zeros((T.shape[0], self.n_splits))
for j, (train_idx, test_idx) in enumerate(folds):
X_train = X[train_idx]
y_train = y[train_idx]
X_holdout = X[test_idx]
print ("Base model %d: fit %s model | fold %d" % (i+1, str(clf).split('(')[0], j+1))
clf.fit(X_train, y_train)
cross_score = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
print("cross_score [roc-auc]: %.5f [gini]: %.5f" % (cross_score.mean(), 2*cross_score.mean()-1))
y_pred = clf.predict_proba(X_holdout)[:,1]
S_train[test_idx, i] = y_pred
S_test_i[:, j] = clf.predict_proba(T)[:,1]
S_test[:, i] = S_test_i.mean(axis=1)
results = cross_val_score(self.stacker, S_train, y, cv=3, scoring='roc_auc')
# Calculate gini factor as 2 * AUC - 1
print("Stacker score [gini]: %.5f" % (2 * results.mean() - 1))
self.stacker.fit(S_train, y)
res = self.stacker.predict_proba(S_test)[:,1]
return res
2) base model을 위한 파라미터
LightGBM 모델과 XGB 모델과 다른 3개의 파라미터를 준비한다.
In [140]:
# 각각의 모델은 train data로 사용되어진다.
# lgb1
lgb_params1 = {}
lgb_params1['learning_rate'] = 0.02
lgb_params1['n_estimators'] = 650
lgb_params1['max_bin'] = 10
lgb_params1['subsample'] = 0.8
lgb_params1['subsample_freq'] = 10
lgb_params1['colsample_bytree'] = 0.8
lgb_params1['min_child_samples'] = 500
lgb_params1['seed'] = 314
lgb_params1['num_threads'] = 4
#lgb 2
lgb_params2 = {}
lgb_params2['n_estimators'] = 1090
lgb_params2['learning_rate'] = 0.02
lgb_params2['colsample_bytree'] = 0.3
lgb_params2['subsample'] = 0.7
lgb_params2['subsample_freq'] = 2
lgb_params2['num_leaves'] = 16
lgb_params2['seed'] = 314
lgb_params2['num_threads'] = 4
#lgb 3
lgb_params3 = {}
lgb_params3['n_estimators'] = 1100
lgb_params3['max_depth'] = 4
lgb_params3['learning_rate'] = 0.02
lgb_params3['seed'] = 314
lgb_params2['num_threads'] = 4
# XGBoost params
xgb_params = {}
xgb_params['objective'] = 'binary:logistic'
xgb_params['learning_rate'] = 0.04
xgb_params['n_estimators'] = 490
xgb_params['max_depth'] = 4
xgb_params['subsample'] = 0.9
xgb_params['colsample_bytree'] = 0.9
xgb_params['min_child_weight'] = 16
xgb_params['num_threads'] = 4
3) 파라미터 모델 초기화
In [141]:
lgb_model1 = LGBMClassifier(**lgb_params1)
lgb_model2 = LGBMClassifier(**lgb_params2)
lgb_model3 = LGBMClassifier(**lgb_params3)
xgb_model = XGBClassifier(**xgb_params)
log_model = LogisticRegression()
In [142]:
stack = Ensemble(n_splits=3, stacker=log_model, base_models=(lgb_model1, lgb_model2, lgb_model3, xgb_model))
7. 예측 모델 실행
In [143]:
y_prediction = stack.fit_predict(trainset, target_train, testset)
In [144]:
submission = pd.DataFrame()
submission['id'] = id_test
submission['target'] = y_prediction
submission.to_csv('stacked.csv', index=False)
'Competition > Kaggle' 카테고리의 다른 글
[kaggle][필사] Costa Rican Household Proverty (1) (0) | 2020.09.19 |
---|---|
[kaggle][필사] Porto serqruo safe prediction(Gabriel Preda) (2) (0) | 2020.09.08 |
[kaggle][필사] Porto serqruo safe prediction(Gabriel Preda) (1) (0) | 2020.09.07 |