Competition/Kaggle

[kaggle][필사] Porto serqruo safe prediction(Gabriel Preda) (2)

bisi 2020. 9. 8. 11:57

이번 주제는 Porto serqruo safe prediction 로,

목표는 운전자가 내년에 자동차 보험 청구를 시작할 확률울 예측하는 모델을 구축 하는 것이다.

 

이번 필사는 Gabriel Preda님의 코드를 참고하였다.

 

 

총 3가지 포스트로 내용을 나누었고, 순서는 아래와 같다. 

Porto serqruo safe prediction(Gabriel Preda) (1)

더보기

1. 데이터 분석 준비

2. 데이터 설명

3. Metadata 설명

 

Porto serqruo safe prediction(Gabriel Preda) (2)

더보기

4. 데이터 분석과 통계

 

Porto serqruo safe prediction(Gabriel Preda) (3)

더보기

5. 모델을 위한 데이터 준비 

6. 모델 준비

7. 예측 모델 실행

 

 

 


 

 

 

 

 

4. 데이터 분석과 통계

1) Target 변수

In [76]:
plt.figure()
fig, ax = plt.subplots(figsize=(6,6))
x = trainset['target'].value_counts().index.values
y = trainset['target'].value_counts().values

# Bar plot
# target 중간값으로 내림차순
sns.barplot(ax=ax, x=x, y=y)
plt.ylabel('Number of values', fontsize=12)
plt.xlabel('Target values', fontsize=12)
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()
 
<Figure size 432x288 with 0 Axes>
 
 

target value가 1인 경우는 오직 3.64%정도이다. 이것은 높은 불균형을 의미한다. 이럴 경우엔 target value 0을 언더샘플링(undersampling) 하거나, 1을 오버 샘플링(oversampling)하는 방법이 있다. 지금은 0 데이터셋이 크기 때문에 target value 0을 언더샘플링을 진행한다.

2) real features

In [77]:
variable = metadata[(metadata.type=='real') & (metadata.preserve)].index
trainset[variable].describe()
Out[77]:
  ps_reg_01 ps_reg_02 ps_reg_03 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03
count 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000
mean 0.610991 0.439184 0.551102 0.379945 0.813265 0.276256 3.065899 0.449756 0.449589 0.449849
std 0.287643 0.404264 0.793506 0.058327 0.224588 0.357154 0.731366 0.287198 0.286893 0.287153
min 0.000000 0.000000 -1.000000 -1.000000 0.250619 -1.000000 0.000000 0.000000 0.000000 0.000000
25% 0.400000 0.200000 0.525000 0.316228 0.670867 0.333167 2.828427 0.200000 0.200000 0.200000
50% 0.700000 0.300000 0.720677 0.374166 0.765811 0.368782 3.316625 0.500000 0.400000 0.500000
75% 0.900000 0.600000 1.000000 0.400000 0.906190 0.396485 3.605551 0.700000 0.700000 0.700000
max 0.900000 1.800000 4.037945 1.264911 3.720626 0.636396 3.741657 0.900000 0.900000 0.900000
In [78]:
(pow(trainset['ps_car_12']*10,2)).head()
Out[78]:
0    16.00
1    10.00
2    10.00
3    14.00
4     9.99
Name: ps_car_12, dtype: float64
In [79]:
trainset['ps_car_15'].head()
Out[79]:
0    3.605551
1    2.449490
2    3.316625
3    2.000000
4    2.000000
Name: ps_car_15, dtype: float64
In [80]:
(pow(trainset['ps_car_15'],2)).head()
Out[80]:
0    13.0
1     6.0
2    11.0
3     4.0
4     4.0
Name: ps_car_15, dtype: float64
In [81]:
sample = trainset.sample(frac=0.05)
var = ['ps_car_12', 'ps_car_15', 'target']
sample = sample[var]
sns.pairplot(sample, hue='target', palette='Set1', diag_kind='kde')
plt.show()
 
 

3) calculated features

  • KDE 그래프를 통해 출력해보자.
    • 커널 밀도 추정(KDE : Kernel Density Estimator) : 히스토그램 같은 분포를 분포를 곡선화하여 그려주는 것.
    • R에서는 destiny 함수 사용
In [82]:
var = metadata[(metadata.type =='real') & (metadata.preserve)].index
i = 0
t1 = trainset.loc[trainset['target'] != 0]
t0 = trainset.loc[trainset['target'] == 0]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(3,4,figsize=(16,12))

for feature in var :
    i += 1
    plt.subplot(3,4,i)
    sns.kdeplot(t1[feature], bw=0.5, label="target = 1")
    sns.kdeplot(t0[feature], bw=0.5, label="target = 0")
    plt.ylabel('Destiny plot', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()

    plt.tick_params(axis='both', which='major', labelsize=12)

plt.show()
 
<Figure size 432x288 with 0 Axes>
 
 
  • ps_reg_02, ps_car_13, ps_car_15 변수는 다른 변수들과 다르게 target=0 이 target=1 보다 더 높다.
 

4) real feature의 상관관계

In [83]:
def corr_heatmap(var):
    correlations = trainset[var].corr()
    
    cmap = sns.diverging_palette(50,10,as_cmap=True)
    
    fit, ax = plt.subplots(figsize=(10,10))
    sns.heatmap(correlations, cmap=cmap, vmax=1.0, center=-0, fmt='.2f',
               square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .75})
    plt.show()
    
var = metadata[(metadata.type == 'real') & (metadata.preserve)].index
corr_heatmap(var)
 
 
  • 강한 상관관계 변수

    • ps_reg_01 와 ps_reg_02 : 0.47
    • ps_reg_01 와 ps_reg_03 : 0.64
    • ps_reg_02 와 ps_reg_03 : 0.52
    • ps_car_12 와 ps_car_13 : 0.67
    • ps_car_13 와 ps_car_15 : 0.53
  • 위의 변수들을 소량만 샘플링하여 차트화 해보자

In [84]:
sample = trainset.sample(frac=0.05)
var = ['ps_reg_01', 'ps_reg_02', 'ps_reg_03','ps_car_12','ps_car_13','ps_car_15','target']
sample = sample[var]
sns.pairplot(sample, hue='target', palette='Set1', diag_kind='kde')
plt.show()
 
 

5) binary feature

In [85]:
v = metadata[(metadata.type =='binary') &(metadata.preserve)].index
trainset[v].describe()
Out[85]:
  target ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
count 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000
mean 0.036448 0.393742 0.257033 0.163921 0.185304 0.000373 0.001692 0.009439 0.000948 0.660823 0.121081 0.153446 0.122427 0.627840 0.554182 0.287182 0.349024 0.153318
std 0.187401 0.488579 0.436998 0.370205 0.388544 0.019309 0.041097 0.096693 0.030768 0.473430 0.326222 0.360417 0.327779 0.483381 0.497056 0.452447 0.476662 0.360295
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000
75% 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
 

trianset의 binary data의 분포를 보자. 파란색은 target이 0인 비율이고, 빨간색은 target 이 1인 비율을 나타낸다.

In [86]:
bin_col = [col for col in trainset.columns if '_bin' in col]
zero_list = []
one_list = []

for col in bin_col:
    zero_list.append((trainset[col]==0).sum() / trainset.shape[0]*100)
    one_list.append((trainset[col]==1).sum() / trainset.shape[0]*100)
    
plt.figure()
fit, ax = plt.subplots(figsize=(6,6))

# bar plot
p1 = sns.barplot(ax=ax, x= bin_col, y=zero_list, color="blue")
p2 = sns.barplot(ax=ax, x= bin_col, y=one_list,bottom=zero_list, color="red")

plt.ylabel('Percentage of zero/one [%]', fontsize=12)
plt.xlabel('Binary features', fontsize=12)

locs, labels = plt.xticks() # 어떤 라벨로 표현할지 조절하고 싶을때, plt.xticks() 사용
plt.setp(labels, rotation=90)
plt.tick_params(axis='both', which='major', labelsize=12)
plt.legend((p1, p2), ('Zero', 'One'))
plt.show()
 
c:\users\hanbit\appdata\local\programs\python\python37\lib\site-packages\ipykernel_launcher.py:22: UserWarning: Legend does not support <AxesSubplot:xlabel='Binary features', ylabel='Percentage of zero/one [%]'> instances.
A proxy artist may be used instead.
See: https://matplotlib.org/users/legend_guide.html#creating-artists-specifically-for-adding-to-the-legend-aka-proxy-artists
 
<Figure size 432x288 with 0 Axes>
 
 
  • 1 비율이 작은 변수 : ps_ind_10_bin, ps_ind_11_bin, ps_ind_12_bin, ps_ind_13_bin
  • 1 비율이 큰 변수 : ps_ind_16_bin, ps_cals_16_bin (60% 이상).
In [87]:
var = metadata[(metadata.type =='binary') & (metadata.preserve)].index
var = [col for col in trainset.columns if '_bin' in col]

i = 0
t1 = trainset.loc[trainset['target'] != 0]
t0 = trainset.loc[trainset['target'] == 0]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(6,3,figsize=(12,24))

for feature in var :
    i += 1
    plt.subplot(6,3,i)
    sns.kdeplot(t1[feature], bw=0.5, label="target = 1")
    sns.kdeplot(t0[feature], bw=0.5, label="target = 0")
    plt.ylabel('Destiny plot', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()

    plt.tick_params(axis='both', which='major', labelsize=12)

plt.show()
 
<Figure size 432x288 with 0 Axes>
 
 

6) categorical features

In [88]:
var = metadata[(metadata.type == 'categorical') & (metadata.preserve)].index

for feature in var :
    fig, ax = plt.subplots(figsize=(6,6))
    
    cat_perc = trainset[[feature, 'target']].groupby([feature], as_index=False).mean()
    cat_perc.sort_values(by='target', ascending=False, inplace=True)
    
    sns.barplot(ax=ax, x= feature, y ='target', data=cat_perc, order=cat_perc[feature])
    plt.ylabel('Percent of target with value 1 [%]', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    plt.tick_params(axis='both', which='major', labelsize=12)
    plt.show()
    
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

categorial feature는 denstity plot으로 나타낼수 있다.

In [89]:
var = metadata[(metadata.type =='categorical') & (metadata.preserve)].index

i = 0
t1 = trainset.loc[trainset['target'] != 0]
t0 = trainset.loc[trainset['target'] == 0]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(16,16))

for feature in var :
    i += 1
    plt.subplot(4,4,i)
    sns.kdeplot(t1[feature], bw=0.5, label="target = 1")
    sns.kdeplot(t0[feature], bw=0.5, label="target = 0")
    plt.ylabel('Destiny plot', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()
 
<Figure size 432x288 with 0 Axes>
 
 

7) trainset, testset 사이의 데이터 불균형

 

먼저 registration 변수를 알아보자

In [90]:
var = metadata[(metadata.category =='registration') & (metadata.preserve)].index

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(1,3,figsize=(12,4))
i=0
for feature in var :
    i += 1
    plt.subplot(1,3,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()
 
<Figure size 432x288 with 0 Axes>
 
 

모든 registration 변수는 균형있게 분포되어있다. 다음은 car 변수에 대해서 알아보자

In [91]:
var = metadata[(metadata.category =='car') & (metadata.preserve)].index

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,16))
i=0
for feature in var :
    i += 1
    plt.subplot(4,4,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()
 
<Figure size 432x288 with 0 Axes>
 
 

car 변수도 trainset과 testset의 데이터 분포는 비슷해보인다. 다음으로 ind(individual) 변수를 보자.

In [92]:
var = metadata[(metadata.category =='individual') & (metadata.preserve)].index

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(5,4,figsize=(20,16))
i=0
for feature in var :
    i += 1
    plt.subplot(5,4,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()
 
<Figure size 432x288 with 0 Axes>
 
 

individual 변수도 고르게 분포되어 있다. 마지막으로 calc 변수를 보자.

In [93]:
var = metadata[(metadata.category =='calculated') & (metadata.preserve)].index

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(5,4,figsize=(20,16))
i=0
for feature in var :
    i += 1
    plt.subplot(5,4,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()
 
<Figure size 432x288 with 0 Axes>
 

 


github 소스코드 바로가기