Competition/Kaggle

[kaggle] House Prices: Advanced Regression Techniques - 상관관계, 정규 분포

bisi 2020. 4. 30. 10:39

Kaggle에서 진행하는 House Prices: Advanced Regression Techniques 데이터셋을 분석하였다. 

Regresssion을 통한 집값 예측하기 위해 그전에 아래 4가지 단계로 나누어 데이터 탐색을 진행하였다.  

 

출처 : https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

 

House Prices: Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

www.kaggle.com


상관관계, 정규 분포 

 

HouseData_Cleansing_Final_HB-Copy4
In [22]:
# 면적과 가격의 상관관계 분석

corrMatt_area = house_train[["LotArea","MasVnrArea","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","1stFlrSF","2ndFlrSF","LowQualFinSF","GrLivArea","GarageArea","WoodDeckSF","OpenPorchSF","EnclosedPorch","3SsnPorch","ScreenPorch","PoolArea","SalePrice"]]
corrMatt_area = corrMatt_area.corr()
print(corrMatt_area)

mask = np.array(corrMatt_area)
mask[np.tril_indices_from(mask)] = False
                LotArea  MasVnrArea  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  \
LotArea        1.000000    0.104160    0.214103    0.111170  -0.002618   
MasVnrArea     0.104160    1.000000    0.264736   -0.072319   0.114442   
BsmtFinSF1     0.214103    0.264736    1.000000   -0.050117  -0.495251   
BsmtFinSF2     0.111170   -0.072319   -0.050117    1.000000  -0.209294   
BsmtUnfSF     -0.002618    0.114442   -0.495251   -0.209294   1.000000   
TotalBsmtSF    0.260833    0.363936    0.522396    0.104810   0.415360   
1stFlrSF       0.299475    0.344501    0.445863    0.097117   0.317987   
2ndFlrSF       0.050986    0.174561   -0.137079   -0.099260   0.004469   
LowQualFinSF   0.004779   -0.069071   -0.064503    0.014807   0.028167   
GrLivArea      0.263116    0.390857    0.208171   -0.009640   0.240257   
GarageArea     0.180403    0.373066    0.296970   -0.018227   0.183303   
WoodDeckSF     0.171698    0.159718    0.204306    0.067898  -0.005316   
OpenPorchSF    0.084774    0.125703    0.111761    0.003093   0.129005   
EnclosedPorch -0.018340   -0.110204   -0.102303    0.036543  -0.002538   
3SsnPorch      0.020423    0.018796    0.026451   -0.029993   0.020764   
ScreenPorch    0.043160    0.061466    0.062021    0.088871  -0.012579   
PoolArea       0.077672    0.011723    0.140491    0.041709  -0.035092   
SalePrice      0.263843    0.477493    0.386420   -0.011378   0.214479   

               TotalBsmtSF  1stFlrSF  2ndFlrSF  LowQualFinSF  GrLivArea  \
LotArea           0.260833  0.299475  0.050986      0.004779   0.263116   
MasVnrArea        0.363936  0.344501  0.174561     -0.069071   0.390857   
BsmtFinSF1        0.522396  0.445863 -0.137079     -0.064503   0.208171   
BsmtFinSF2        0.104810  0.097117 -0.099260      0.014807  -0.009640   
BsmtUnfSF         0.415360  0.317987  0.004469      0.028167   0.240257   
TotalBsmtSF       1.000000  0.819530 -0.174512     -0.033245   0.454868   
1stFlrSF          0.819530  1.000000 -0.202646     -0.014241   0.566024   
2ndFlrSF         -0.174512 -0.202646  1.000000      0.063353   0.687501   
LowQualFinSF     -0.033245 -0.014241  0.063353      1.000000   0.134683   
GrLivArea         0.454868  0.566024  0.687501      0.134683   1.000000   
GarageArea        0.486665  0.489782  0.138347     -0.067601   0.468997   
WoodDeckSF        0.232019  0.235459  0.092165     -0.025444   0.247433   
OpenPorchSF       0.247264  0.211671  0.208026      0.018251   0.330224   
EnclosedPorch    -0.095478 -0.065292  0.061989      0.061081   0.009113   
3SsnPorch         0.037384  0.056104 -0.024358     -0.004296   0.020643   
ScreenPorch       0.084489  0.088758  0.040606      0.026799   0.101510   
PoolArea          0.126053  0.131525  0.081487      0.062157   0.170205   
SalePrice         0.613581  0.605852  0.319334     -0.025606   0.708624   

               GarageArea  WoodDeckSF  OpenPorchSF  EnclosedPorch  3SsnPorch  \
LotArea          0.180403    0.171698     0.084774      -0.018340   0.020423   
MasVnrArea       0.373066    0.159718     0.125703      -0.110204   0.018796   
BsmtFinSF1       0.296970    0.204306     0.111761      -0.102303   0.026451   
BsmtFinSF2      -0.018227    0.067898     0.003093       0.036543  -0.029993   
BsmtUnfSF        0.183303   -0.005316     0.129005      -0.002538   0.020764   
TotalBsmtSF      0.486665    0.232019     0.247264      -0.095478   0.037384   
1stFlrSF         0.489782    0.235459     0.211671      -0.065292   0.056104   
2ndFlrSF         0.138347    0.092165     0.208026       0.061989  -0.024358   
LowQualFinSF    -0.067601   -0.025444     0.018251       0.061081  -0.004296   
GrLivArea        0.468997    0.247433     0.330224       0.009113   0.020643   
GarageArea       1.000000    0.224666     0.241435      -0.121777   0.035087   
WoodDeckSF       0.224666    1.000000     0.058661      -0.125989  -0.032771   
OpenPorchSF      0.241435    0.058661     1.000000      -0.093079  -0.005842   
EnclosedPorch   -0.121777   -0.125989    -0.093079       1.000000  -0.037305   
3SsnPorch        0.035087   -0.032771    -0.005842      -0.037305   1.000000   
ScreenPorch      0.051412   -0.074181     0.074304      -0.082864  -0.031436   
PoolArea         0.061047    0.073378     0.060762       0.054203  -0.007992   
SalePrice        0.623431    0.324413     0.315856      -0.128578   0.044584   

               ScreenPorch  PoolArea  SalePrice  
LotArea           0.043160  0.077672   0.263843  
MasVnrArea        0.061466  0.011723   0.477493  
BsmtFinSF1        0.062021  0.140491   0.386420  
BsmtFinSF2        0.088871  0.041709  -0.011378  
BsmtUnfSF        -0.012579 -0.035092   0.214479  
TotalBsmtSF       0.084489  0.126053   0.613581  
1stFlrSF          0.088758  0.131525   0.605852  
2ndFlrSF          0.040606  0.081487   0.319334  
LowQualFinSF      0.026799  0.062157  -0.025606  
GrLivArea         0.101510  0.170205   0.708624  
GarageArea        0.051412  0.061047   0.623431  
WoodDeckSF       -0.074181  0.073378   0.324413  
OpenPorchSF       0.074304  0.060762   0.315856  
EnclosedPorch    -0.082864  0.054203  -0.128578  
3SsnPorch        -0.031436 -0.007992   0.044584  
ScreenPorch       1.000000  0.051307   0.111447  
PoolArea          0.051307  1.000000   0.092404  
SalePrice         0.111447  0.092404   1.000000  
In [23]:
# 면적과 가격의 상관관계 그래프
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
sns.heatmap(corrMatt_area, mask=mask,vmax=.8, square=True,annot=True)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x18997b82f60>
  • 1층의 평방피트(1stFlrSF)와 지하실 면적의 총 평방피트(TotalBsmtSF)는 0.82로 상관관계가 가장 높다.
  • 가격과 가장 연관이 높은건 지상층의에서 살수 있는 면적의 평방 피트(GrLivArea)이다. (0.71)
  • 평방 피트(면적)과 관련된 컬럼은 가격과 거의 연관 관계가 없다.
In [25]:
# 주택구성과 가격의 상관관계 그래프
# corrMatt_area = house_train[["Bedroom","Kitchen","TotRmsAbvGrd","Functional","Fireplaces","GarageCond","GarageType","GarageYrBlt","BldgType","HouseStyle","OverallQual","OverallCond","HeatingQC","Neighborhood","Condition1","Condition2","RoofMatl","SalePrice"]]
corrMatt_area = house_train[["TotRmsAbvGrd","Functional","Fireplaces","GarageCond","GarageType","GarageYrBlt","BldgType","HouseStyle","OverallQual","OverallCond","HeatingQC","Neighborhood","Condition1","Condition2","RoofMatl","SalePrice"]]
corrMatt_area = corrMatt_area.corr()
print(corrMatt_area)

mask = np.array(corrMatt_area)
mask[np.tril_indices_from(mask)] = False
              TotRmsAbvGrd  Fireplaces  GarageYrBlt  OverallQual  OverallCond  \
TotRmsAbvGrd      1.000000    0.326114     0.148112     0.427452    -0.057583   
Fireplaces        0.326114    1.000000     0.046822     0.396765    -0.023820   
GarageYrBlt       0.148112    0.046822     1.000000     0.547766    -0.324297   
OverallQual       0.427452    0.396765     0.547766     1.000000    -0.091932   
OverallCond      -0.057583   -0.023820    -0.324297    -0.091932     1.000000   
SalePrice         0.533723    0.466929     0.486362     0.790982    -0.077856   

              SalePrice  
TotRmsAbvGrd   0.533723  
Fireplaces     0.466929  
GarageYrBlt    0.486362  
OverallQual    0.790982  
OverallCond   -0.077856  
SalePrice      1.000000  
In [26]:
# 주택구성과 가격의 상관관계 그래프
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
sns.heatmap(corrMatt_area, mask=mask,vmax=.8, square=True,annot=True)
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x18996e5a320>
  • 전체 재료 및 마감 품질(OverallQual)이 0.79로 상관관계가 가장 높게 나왔다.
  • 나머지 변수들은 별로 상관이 없어 보인다.
In [27]:
# trainWithoutOutliers
house_trainWithoutOutliers = house_train[np.abs(house_train["SalePrice"] - house_train["SalePrice"].mean()) <= (3*house_train["SalePrice"].std())]

print(house_train.shape)
print(house_trainWithoutOutliers.shape)
(1460, 81)
(1438, 81)
In [28]:
# saleprice 가격 분포도 파악(정규분포 적용, 중심 극한 정리)
figure, axes = plt.subplots(ncols=2, nrows=2)
figure.set_size_inches(12, 10)

sns.distplot(house_train["SalePrice"], ax=axes[0][0])
stats.probplot(house_train["SalePrice"], dist='norm', fit=True, plot=axes[0][1])
sns.distplot(np.log(house_trainWithoutOutliers["SalePrice"]), ax=axes[1][0])
stats.probplot(np.log1p(house_trainWithoutOutliers["SalePrice"]), dist='norm', fit=True, plot=axes[1][1])
Out[28]:
((array([-3.30088288, -3.04336783, -2.90014258, ...,  2.90014258,
          3.04336783,  3.30088288]),
  array([10.46027076, 10.47197813, 10.54273278, ..., 12.92391488,
         12.92999391, 12.93675402])),
 (0.376514802311084, 12.007111579203219, 0.9958381515984676))
  • 판매가격(SalePrice)의 이상치(Outlier) 제거 전후의 차트 비교 결과, 이상치를 제거한 케이스가 정규분포에 더 가까워진다.
참고
  • Q-Q(Quantile-Quantile) 플롯은 분석할 표본 데이터의 분포와 정규분포의 분포 형태를 비교하여 표본 데이터가 정규분포를 따르는지 검사하는 간단한 시각적 도구이다.
  • 사이파이 패키지의 stats 서브패키지는 Q-Q 플롯을 계산하고 그리기 위한 probplot() 명령을 제공한다.

 


관련 글 모아보기

 

[kaggle] House Prices: Advanced Regression Techniques (1) 데이터형태
[kaggle] House Prices: Advanced Regression Techniques (2) 범주형 데이터 인코딩 
[kaggle] House Prices: Advanced Regression Techniques (3) 그래프 
[kaggle] House Prices: Advanced Regression Techniques (4) 상관관계, 정규 분포