import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn')
sns.set(font_scale=2.5)

import missingno as msno
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

1. Dataset 확인

df_train = pd.read_csv('../data/train.csv')
df_test = pd.read_csv('../data/test.csv')

df_train.head()

각 feature에 대한 통계치¶

df_train.describe()

df_test.describe()

1.1 null data check

for col in df_train.columns:
    msg = 'column: {:>10} \t Percent of NaN value: {:.2f}%'.format(col, 100* (df_train[col].isnull().sum()/df_train[col].shape[0]))
    print(msg)

column: PassengerId 	 Percent of NaN value: 0.00%
column:   Survived 	 Percent of NaN value: 0.00%
column:     Pclass 	 Percent of NaN value: 0.00%
column:       Name 	 Percent of NaN value: 0.00%
column:        Sex 	 Percent of NaN value: 0.00%
column:        Age 	 Percent of NaN value: 19.87%
column:      SibSp 	 Percent of NaN value: 0.00%
column:      Parch 	 Percent of NaN value: 0.00%
column:     Ticket 	 Percent of NaN value: 0.00%
column:       Fare 	 Percent of NaN value: 0.00%
column:      Cabin 	 Percent of NaN value: 77.10%
column:   Embarked 	 Percent of NaN value: 0.22%

for col in df_test.columns:
    msg = 'column: {:>10} \t Percent of NaN value: {:.2f}%'.format(col, 100* (df_test[col].isnull().sum() / df_test[col].shape[0]))
    print(msg)

column: PassengerId 	 Percent of NaN value: 0.00%
column:     Pclass 	 Percent of NaN value: 0.00%
column:       Name 	 Percent of NaN value: 0.00%
column:        Sex 	 Percent of NaN value: 0.00%
column:        Age 	 Percent of NaN value: 20.57%
column:      SibSp 	 Percent of NaN value: 0.00%
column:      Parch 	 Percent of NaN value: 0.00%
column:     Ticket 	 Percent of NaN value: 0.00%
column:       Fare 	 Percent of NaN value: 0.24%
column:      Cabin 	 Percent of NaN value: 78.23%
column:   Embarked 	 Percent of NaN value: 0.00%

Train, Test Set 에서 Age(둘다 약 20%), Cabin(둘다 약 80%), Ebarked(Train만 0.22%) null data가 존재함
아래는 MANO 라이브러리를 사용하여 null data를 확인한 결과 이다.

msno.matrix(df=df_train.iloc[:,:], figsize=(8,8), color=(0.8, 0.5, 0.2))

<matplotlib.axes._subplots.AxesSubplot at 0x23d89d00>

msno.bar(df=df_train.iloc[:,:], figsize=(8,8), color=(0.8, 0.5, 0.2))

<matplotlib.axes._subplots.AxesSubplot at 0x23da8340>

msno.bar(df=df_test.iloc[:,:], figsize=(8,8), color=(0.8, 0.5, 0.2))

<matplotlib.axes._subplots.AxesSubplot at 0x23d28fd0>

1.2 Target label 확인

target label 이 어떤 distribution을 가지고 있는지 확인이 필요함.
binary classification 문제의 1과 0의 분포에 따라 모델의 평가 방법이 달라 질 수 있음.

f, ax = plt.subplots(1, 2, figsize=(18,8))

df_train['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('')

sns.countplot('Survived', data=df_train, ax=ax[1])
ax[1].set_title('Count plot - Survived')

plt.show()

사망률이 더 크다
생존률은 38.4% 이다
target label의 분포가 제법 균일(balanced)하다.
만약 불균일할 경우, 예를 들어 100중 1이 99, 0이 1개인 경우에는 만약 모델이 모든것을 1이라 해도 정확도가 99%이기 때문에 다시 고려해봐야한다.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	PassengerId	Pclass	Age	SibSp	Parch	Fare
count	418.000000	418.000000	332.000000	418.000000	418.000000	417.000000
mean	1100.500000	2.265550	30.272590	0.447368	0.392344	35.627188
std	120.810458	0.841838	14.181209	0.896760	0.981429	55.907576
min	892.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	996.250000	1.000000	21.000000	0.000000	0.000000	7.895800
50%	1100.500000	3.000000	27.000000	0.000000	0.000000	14.454200
75%	1204.750000	3.000000	39.000000	1.000000	0.000000	31.500000
max	1309.000000	3.000000	76.000000	8.000000	9.000000	512.329200

[kaggle] Titanic: Machine Learning from Disaster (2) (0)	2020.09.01
[kaggle] House Prices: Advanced Regression Techniques - 상관관계, 정규 분포 (0)	2020.04.30
[kaggle] House Prices: Advanced Regression Techniques (3) 그래프 (0)	2020.04.28

춤추는 개발자

[kaggle] Titanic: Machine Learning from Disaster (1)

1. Dataset 확인

각 feature에 대한 통계치¶

1.1 null data check

1.2 Target label 확인

'Competition > Kaggle' 카테고리의 다른 글

'Competition/Kaggle'의 다른글

티스토리툴바

[kaggle] Titanic: Machine Learning from Disaster (1)

1. Dataset 확인

각 feature에 대한 통계치¶

1.1 null data check

1.2 Target label 확인

'Competition > Kaggle' 카테고리의 다른 글

'Competition/Kaggle'의 다른글

관련글

티스토리툴바