Kaggle에서 진행하는 House Prices: Advanced Regression Techniques 데이터셋을 분석하였다.
Regresssion을 통한 집값 예측하기 위해 그전에 아래 4가지 단계로 나누어 데이터 탐색을 진행하였다.
출처 : https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
데이터 형태
데이터 불러오기¶
In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from tqdm import tqdm_notebook
import matplotlib as mpl
from scipy import stats
from sklearn.preprocessing import LabelEncoder
# 노트북 안에 그래프를 그리기 위해
%matplotlib inline
# 그래프에서 격자로 숫자 범위가 눈에 잘 띄도록 ggplot 스타일을 사용
plt.style.use('ggplot')
# 그래프에서 마이너스 폰트 깨지는 문제에 대한 대처
mpl.rcParams['axes.unicode_minus'] = False
In [2]:
house_train = pd.read_csv('../data/house_train.csv', encoding='utf-8')
house_train.head()
Out[2]:
In [3]:
house_train.shape
Out[3]:
In [4]:
house_train.columns
Out[4]:
In [5]:
#데이터 컬럼 형식 확인
house_train.info()
In [6]:
# 데이터 요약
house_train.describe()
Out[6]:
In [7]:
#null 확인
house_train.isnull().sum()
Out[7]:
In [8]:
# NULL 확인
import missingno as msno
#모든 컬럼에 대한 결측치
msno.matrix(house_train, figsize=(12,5))
Out[8]:
In [9]:
# 독립변수
# 수치형, 연속형 변수 구분
# data description 문서와 실제 train.csv 파일의 column이 다른것 변경 ('Kitchen'->'KitchenAbvGr', 'Bedroom'->'BedroomAbvGr')
# data type : serial
numerical_col_df = house_train[['Id','SalePrice','LotFrontage','LotArea','OverallQual','OverallCond','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal']]
category_col_df = house_train[['Id','MSSubClass','MSZoning','Street','Alley','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','YearBuilt','YearRemodAdd','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir','Electrical','KitchenQual','Functional','FireplaceQu','GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','PavedDrive','PoolQC','Fence','MiscFeature','MoSold','YrSold','SaleType','SaleCondition']]
category_date_col_df = house_train[['Id','YearBuilt','YearRemodAdd','GarageYrBlt','MoSold','YrSold']]
In [10]:
#데이터 형에 따른 결측치 확인
#수치형 데이터 LotFrontage, MasVnrArea
msno.matrix(numerical_col_df, figsize=(12,5))
Out[10]:
In [11]:
#범주형 데이터
#Alley, MasVnrType, Bsmt~관련, Fireplace 등등
msno.matrix(category_col_df, figsize=(20,5))
Out[11]:
In [12]:
# column list , data type : list
numerical_col_list = ["SalePrice","LotFrontage","LotArea","OverallQual","OverallCond","MasVnrArea","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","1stFlrSF","2ndFlrSF","LowQualFinSF","GrLivArea","BsmtFullBath","BsmtHalfBath","FullBath","HalfBath","BedroomAbvGr","KitchenAbvGr","TotRmsAbvGrd","Fireplaces","GarageCars","GarageArea","WoodDeckSF","OpenPorchSF","EnclosedPorch","3SsnPorch","ScreenPorch","PoolArea","MiscVal"]
category_col_list = ["MSSubClass","MSZoning","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Neighborhood","Condition1","Condition2","BldgType","HouseStyle","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","ExterQual","ExterCond","Foundation","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","Heating","HeatingQC","CentralAir","Electrical","KitchenQual","Functional","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PavedDrive","PoolQC","Fence","MiscFeature","SaleType","SaleCondition"]
category_date_col_list = ["YearBuilt","YearRemodAdd","GarageYrBlt","MoSold","YrSold"]
In [13]:
category_col_df.dtypes
Out[13]:
In [14]:
## 결측치 처리
## 수치형(지금은 0으로, 나중에 평균값으로 처리 하는방법도..)
numerical_col_df.LotFrontage.fillna('0', inplace=True)
## 범주형 (string 대체)
category_col_df.fillna('missing', inplace=True)
관련 글 모아보기
[kaggle] House Prices: Advanced Regression Techniques (1) 데이터형태
[kaggle] House Prices: Advanced Regression Techniques (2) 범주형 데이터 인코딩
[kaggle] House Prices: Advanced Regression Techniques (3) 그래프
[kaggle] House Prices: Advanced Regression Techniques (4) 상관관계, 정규 분포
'Competition > Kaggle' 카테고리의 다른 글
[kaggle] House Prices: Advanced Regression Techniques (2) 범주형 데이터 인코딩 (0) | 2020.04.27 |
---|---|
캐클 스터디 커리큘럼 (0) | 2020.03.11 |
Gage R&R (0) | 2020.03.09 |