4. 데이터 분석과 통계

1) Target 변수

plt.figure()
fig, ax = plt.subplots(figsize=(6,6))
x = trainset['target'].value_counts().index.values
y = trainset['target'].value_counts().values

# Bar plot
# target 중간값으로 내림차순
sns.barplot(ax=ax, x=x, y=y)
plt.ylabel('Number of values', fontsize=12)
plt.xlabel('Target values', fontsize=12)
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

<Figure size 432x288 with 0 Axes>

target value가 1인 경우는 오직 3.64%정도이다. 이것은 높은 불균형을 의미한다. 이럴 경우엔 target value 0을 언더샘플링(undersampling) 하거나, 1을 오버 샘플링(oversampling)하는 방법이 있다. 지금은 0 데이터셋이 크기 때문에 target value 0을 언더샘플링을 진행한다.

2) real features

variable = metadata[(metadata.type=='real') & (metadata.preserve)].index
trainset[variable].describe()

(pow(trainset['ps_car_12']*10,2)).head()

0    16.00
1    10.00
2    10.00
3    14.00
4     9.99
Name: ps_car_12, dtype: float64

trainset['ps_car_15'].head()

0    3.605551
1    2.449490
2    3.316625
3    2.000000
4    2.000000
Name: ps_car_15, dtype: float64

(pow(trainset['ps_car_15'],2)).head()

0    13.0
1     6.0
2    11.0
3     4.0
4     4.0
Name: ps_car_15, dtype: float64

sample = trainset.sample(frac=0.05)
var = ['ps_car_12', 'ps_car_15', 'target']
sample = sample[var]
sns.pairplot(sample, hue='target', palette='Set1', diag_kind='kde')
plt.show()

3) calculated features

KDE 그래프를 통해 출력해보자.
- 커널 밀도 추정(KDE : Kernel Density Estimator) : 히스토그램 같은 분포를 분포를 곡선화하여 그려주는 것.
- R에서는 destiny 함수 사용

var = metadata[(metadata.type =='real') & (metadata.preserve)].index
i = 0
t1 = trainset.loc[trainset['target'] != 0]
t0 = trainset.loc[trainset['target'] == 0]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(3,4,figsize=(16,12))

for feature in var :
    i += 1
    plt.subplot(3,4,i)
    sns.kdeplot(t1[feature], bw=0.5, label="target = 1")
    sns.kdeplot(t0[feature], bw=0.5, label="target = 0")
    plt.ylabel('Destiny plot', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()

    plt.tick_params(axis='both', which='major', labelsize=12)

plt.show()

<Figure size 432x288 with 0 Axes>

ps_reg_02, ps_car_13, ps_car_15 변수는 다른 변수들과 다르게 target=0 이 target=1 보다 더 높다.

4) real feature의 상관관계

def corr_heatmap(var):
    correlations = trainset[var].corr()
    
    cmap = sns.diverging_palette(50,10,as_cmap=True)
    
    fit, ax = plt.subplots(figsize=(10,10))
    sns.heatmap(correlations, cmap=cmap, vmax=1.0, center=-0, fmt='.2f',
               square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .75})
    plt.show()
    
var = metadata[(metadata.type == 'real') & (metadata.preserve)].index
corr_heatmap(var)

강한 상관관계 변수
- ps_reg_01 와 ps_reg_02 : 0.47
- ps_reg_01 와 ps_reg_03 : 0.64
- ps_reg_02 와 ps_reg_03 : 0.52
- ps_car_12 와 ps_car_13 : 0.67
- ps_car_13 와 ps_car_15 : 0.53
위의 변수들을 소량만 샘플링하여 차트화 해보자

sample = trainset.sample(frac=0.05)
var = ['ps_reg_01', 'ps_reg_02', 'ps_reg_03','ps_car_12','ps_car_13','ps_car_15','target']
sample = sample[var]
sns.pairplot(sample, hue='target', palette='Set1', diag_kind='kde')
plt.show()

5) binary feature

v = metadata[(metadata.type =='binary') &(metadata.preserve)].index
trainset[v].describe()

trianset의 binary data의 분포를 보자. 파란색은 target이 0인 비율이고, 빨간색은 target 이 1인 비율을 나타낸다.

bin_col = [col for col in trainset.columns if '_bin' in col]
zero_list = []
one_list = []

for col in bin_col:
    zero_list.append((trainset[col]==0).sum() / trainset.shape[0]*100)
    one_list.append((trainset[col]==1).sum() / trainset.shape[0]*100)
    
plt.figure()
fit, ax = plt.subplots(figsize=(6,6))

# bar plot
p1 = sns.barplot(ax=ax, x= bin_col, y=zero_list, color="blue")
p2 = sns.barplot(ax=ax, x= bin_col, y=one_list,bottom=zero_list, color="red")

plt.ylabel('Percentage of zero/one [%]', fontsize=12)
plt.xlabel('Binary features', fontsize=12)

locs, labels = plt.xticks() # 어떤 라벨로 표현할지 조절하고 싶을때, plt.xticks() 사용
plt.setp(labels, rotation=90)
plt.tick_params(axis='both', which='major', labelsize=12)
plt.legend((p1, p2), ('Zero', 'One'))
plt.show()

c:\users\hanbit\appdata\local\programs\python\python37\lib\site-packages\ipykernel_launcher.py:22: UserWarning: Legend does not support <AxesSubplot:xlabel='Binary features', ylabel='Percentage of zero/one [%]'> instances.
A proxy artist may be used instead.
See: https://matplotlib.org/users/legend_guide.html#creating-artists-specifically-for-adding-to-the-legend-aka-proxy-artists

<Figure size 432x288 with 0 Axes>

1 비율이 작은 변수 : ps_ind_10_bin, ps_ind_11_bin, ps_ind_12_bin, ps_ind_13_bin
1 비율이 큰 변수 : ps_ind_16_bin, ps_cals_16_bin (60% 이상).

var = metadata[(metadata.type =='binary') & (metadata.preserve)].index
var = [col for col in trainset.columns if '_bin' in col]

i = 0
t1 = trainset.loc[trainset['target'] != 0]
t0 = trainset.loc[trainset['target'] == 0]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(6,3,figsize=(12,24))

for feature in var :
    i += 1
    plt.subplot(6,3,i)
    sns.kdeplot(t1[feature], bw=0.5, label="target = 1")
    sns.kdeplot(t0[feature], bw=0.5, label="target = 0")
    plt.ylabel('Destiny plot', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()

    plt.tick_params(axis='both', which='major', labelsize=12)

plt.show()

<Figure size 432x288 with 0 Axes>

6) categorical features

var = metadata[(metadata.type == 'categorical') & (metadata.preserve)].index

for feature in var :
    fig, ax = plt.subplots(figsize=(6,6))
    
    cat_perc = trainset[[feature, 'target']].groupby([feature], as_index=False).mean()
    cat_perc.sort_values(by='target', ascending=False, inplace=True)
    
    sns.barplot(ax=ax, x= feature, y ='target', data=cat_perc, order=cat_perc[feature])
    plt.ylabel('Percent of target with value 1 [%]', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    plt.tick_params(axis='both', which='major', labelsize=12)
    plt.show()

categorial feature는 denstity plot으로 나타낼수 있다.

var = metadata[(metadata.type =='categorical') & (metadata.preserve)].index

i = 0
t1 = trainset.loc[trainset['target'] != 0]
t0 = trainset.loc[trainset['target'] == 0]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(16,16))

for feature in var :
    i += 1
    plt.subplot(4,4,i)
    sns.kdeplot(t1[feature], bw=0.5, label="target = 1")
    sns.kdeplot(t0[feature], bw=0.5, label="target = 0")
    plt.ylabel('Destiny plot', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

<Figure size 432x288 with 0 Axes>

7) trainset, testset 사이의 데이터 불균형

먼저 registration 변수를 알아보자

var = metadata[(metadata.category =='registration') & (metadata.preserve)].index

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(1,3,figsize=(12,4))
i=0
for feature in var :
    i += 1
    plt.subplot(1,3,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

<Figure size 432x288 with 0 Axes>

모든 registration 변수는 균형있게 분포되어있다. 다음은 car 변수에 대해서 알아보자

var = metadata[(metadata.category =='car') & (metadata.preserve)].index

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(20,16))
i=0
for feature in var :
    i += 1
    plt.subplot(4,4,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

<Figure size 432x288 with 0 Axes>

car 변수도 trainset과 testset의 데이터 분포는 비슷해보인다. 다음으로 ind(individual) 변수를 보자.

var = metadata[(metadata.category =='individual') & (metadata.preserve)].index

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(5,4,figsize=(20,16))
i=0
for feature in var :
    i += 1
    plt.subplot(5,4,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

<Figure size 432x288 with 0 Axes>

individual 변수도 고르게 분포되어 있다. 마지막으로 calc 변수를 보자.

var = metadata[(metadata.category =='calculated') & (metadata.preserve)].index

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(5,4,figsize=(20,16))
i=0
for feature in var :
    i += 1
    plt.subplot(5,4,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

<Figure size 432x288 with 0 Axes>

	ps_reg_01	ps_reg_02	ps_reg_03	ps_car_12	ps_car_13	ps_car_14	ps_car_15	ps_calc_01	ps_calc_02	ps_calc_03
count	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000
mean	0.610991	0.439184	0.551102	0.379945	0.813265	0.276256	3.065899	0.449756	0.449589	0.449849
std	0.287643	0.404264	0.793506	0.058327	0.224588	0.357154	0.731366	0.287198	0.286893	0.287153
min	0.000000	0.000000	-1.000000	-1.000000	0.250619	-1.000000	0.000000	0.000000	0.000000	0.000000
25%	0.400000	0.200000	0.525000	0.316228	0.670867	0.333167	2.828427	0.200000	0.200000	0.200000
50%	0.700000	0.300000	0.720677	0.374166	0.765811	0.368782	3.316625	0.500000	0.400000	0.500000
75%	0.900000	0.600000	1.000000	0.400000	0.906190	0.396485	3.605551	0.700000	0.700000	0.700000
max	0.900000	1.800000	4.037945	1.264911	3.720626	0.636396	3.741657	0.900000	0.900000	0.900000

	target	ps_ind_06_bin	ps_ind_07_bin	ps_ind_08_bin	ps_ind_09_bin	ps_ind_10_bin	ps_ind_11_bin	ps_ind_12_bin	ps_ind_13_bin	ps_ind_16_bin	ps_ind_17_bin	ps_ind_18_bin	ps_calc_15_bin	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_19_bin	ps_calc_20_bin
count	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000
mean	0.036448	0.393742	0.257033	0.163921	0.185304	0.000373	0.001692	0.009439	0.000948	0.660823	0.121081	0.153446	0.122427	0.627840	0.554182	0.287182	0.349024	0.153318
std	0.187401	0.488579	0.436998	0.370205	0.388544	0.019309	0.041097	0.096693	0.030768	0.473430	0.326222	0.360417	0.327779	0.483381	0.497056	0.452447	0.476662	0.360295
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000
75%	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	1.000000	1.000000	1.000000	0.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

[kaggle][필사] Porto serqruo safe prediction(Gabriel Preda) (3) (0)	2020.09.10
[kaggle][필사] Porto serqruo safe prediction(Gabriel Preda) (1) (0)	2020.09.07
[kaggle] Porto serqruo safe prediction(Bert Carremans) (2) (0)	2020.09.06

춤추는 개발자

[kaggle][필사] Porto serqruo safe prediction(Gabriel Preda) (2)

4. 데이터 분석과 통계

1) Target 변수

2) real features

3) calculated features

4) real feature의 상관관계

5) binary feature

6) categorical features

7) trainset, testset 사이의 데이터 불균형

'Competition > Kaggle' 카테고리의 다른 글

'Competition/Kaggle'의 다른글

티스토리툴바

[kaggle][필사] Porto serqruo safe prediction(Gabriel Preda) (2)

4. 데이터 분석과 통계

1) Target 변수

2) real features

3) calculated features

4) real feature의 상관관계

5) binary feature

6) categorical features

7) trainset, testset 사이의 데이터 불균형

'Competition > Kaggle' 카테고리의 다른 글

'Competition/Kaggle'의 다른글

관련글

티스토리툴바