Zillow's Home Value Prediction

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

%matplotlib inline

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999
pd.options.display.float_format = lambda x: f'{x:.3f}'

1. train data

# train file exploration
train_df = pd.read_csv("../data/zillow-prize-1/train_2016_v2.csv", parse_dates=["transactiondate"])
train_df.shape

(90275, 3)

train_df.head()

Logerror

이번 대회의 목표값인 "logerrr" 변수를 보자. 이것이 분석의 첫걸음이다.

plt.figure(figsize=(8,6))
plt.scatter(range(train_df.shape[0]), np.sort(train_df.logerror.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('logerror', fontsize=12)
plt.show()

양쪽 끝에 이상치 값들이 보인다. 이상치 값을 제고하고 histogram plot을 보자.

# 값의 99%에 해당하는 값을 upper limit 값으로 지정
ulimit = np.percentile(train_df.logerror.values, 99)
# 값의 1%에 해당하는 값을 lower limit 값으로 지정
llimit = np.percentile(train_df.logerror.values, 1)

# ulimit 보다 큰 logerror는 ulimit 값으로 대체
train_df['logerror'][train_df['logerror'] > ulimit] = ulimit
# llimit 보다 작은 logerror는 llimit 값으로 대체
train_df['logerror'][train_df['logerror'] < llimit] = llimit

plt.figure(figsize=(12,8))
sns.distplot(train_df.logerror.values, bins=50, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()

normal한 분포가 되었다.

Transaction Date

date 변수를 알아보자. 매월 발생한 trascations 수를 체크한다.

train_df['transaction_month'] = train_df['transactiondate'].dt.month
cnt_srs = train_df['transaction_month'].value_counts()
plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[3])
plt.xticks(rotation='vertical')
plt.xlabel('Month of transaction', fontsize=12)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.show()

차트를 보면, train 데이터는 2016년 10월 15일 이전의 모든 거래와 2016년 10월 15일 이후 일부 거래가 포함된다. 그래서 우리는 3개월 동안 짧은 bar를 가지고 있다.

train_df['parcelid'].value_counts().reset_index()

ParceId

(train_df['parcelid'].value_counts().reset_index())['parcelid'].value_counts()

1    90026
2      123
3        1
Name: parcelid, dtype: int64

대부분의 parcelid 는 한번씩 나타난다.

2. properties data

# 2016년도의 특징
prop_df = pd.read_csv("../data/zillow-prize-1/properties_2016.csv")
prop_df.shape

c:\users\hanbit\appdata\local\programs\python\python37\lib\site-packages\IPython\core\interactiveshell.py:3146: DtypeWarning: Columns (22,32,34,49,55) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

(2985217, 58)

prop_df.head()

많은 NaN 값이 보인다. 우선 missing value 분포부터 보자.

missing value

missing_df = prop_df.isnull().sum(axis=0).reset_index()

missing_df

missing_df.columns= ['column_name', 'missing_count']
missing_df = missing_df[missing_df['missing_count']>0]
missing_df = missing_df.sort_values(by='missing_count')

ind = np.arange(missing_df.shape[0])
width=0.9
fig, ax = plt.subplots(figsize=(15,20))
rects = ax.barh(ind, missing_df.missing_count.values, color='r')
ax.set_yticks(ind )
ax.set_yticklabels(missing_df.column_name.values, rotation= 'horizontal', fontsize=15)
ax.set_xlabel("Count of missing values", fontsize=15)
ax.set_title("Number of missing values in each column", fontsize=20)
plt.show()

missing value가 많은 변수는 유의미한지 따져보고, drop을 고민해본다.

longitude, latitude

이젠 위도와 경도별 데이터의 분포를 보자.

plt.figure(figsize=(12,12))
sns.jointplot(x=prop_df.latitude.values, y=prop_df.longitude.values, size=10)
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
plt.show()

c:\users\hanbit\appdata\local\programs\python\python37\lib\site-packages\seaborn\axisgrid.py:2264: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

<Figure size 864x864 with 0 Axes>

위의 지도를 보면, 2016년 3개의 counties(Los angeles, Orange and Ventura, California)의 부동산 전체 목록을 제공한다.

우리는 trian에서 90,275개의 행과 properties file에는 2,985,217 행을 가지고 있다.

	parcelid	logerror	transactiondate
0	11016594	0.028	2016-01-01
1	14366692	-0.168	2016-01-01
2	12098116	-0.004	2016-01-01
3	12643413	0.022	2016-01-02
4	14432541	-0.005	2016-01-02

	index	parcelid
0	11842707	3
1	12613442	2
2	12032773	2
3	11729067	2
4	11845988	2
...	...	...
90145	11199862	1
90146	11726199	1
90147	12096888	1
90148	11181433	1
90149	12438686	1

	parcelid	airconditioningtypeid	architecturalstyletypeid	basementsqft	buildingclasstypeid	buildingqualitytypeid	calculatedbathnbr	decktypeid	finishedfloor1squarefeet	calculatedfinishedsquarefeet	finishedsquarefeet12	finishedsquarefeet13	finishedsquarefeet15	finishedsquarefeet50	finishedsquarefeet6	fips	fireplacecnt	fullbathcnt	garagecarcnt	garagetotalsqft	hashottuborspa	heatingorsystemtypeid	latitude	longitude	lotsizesquarefeet	poolcnt	poolsizesum	pooltypeid10	pooltypeid2	pooltypeid7	propertycountylandusecode	propertylandusetypeid	propertyzoningdesc	rawcensustractandblock	regionidcity	regionidcounty	regionidneighborhood	regionidzip	storytypeid	threequarterbathnbr	typeconstructiontypeid	unitcnt	yardbuildingsqft17	yardbuildingsqft26	yearbuilt	numberofstories	fireplaceflag	structuretaxvaluedollarcnt	taxvaluedollarcnt	assessmentyear	landtaxvaluedollarcnt	taxamount	taxdelinquencyflag	taxdelinquencyyear	censustractandblock
0	10754147	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	6037.000	nan	nan	nan	nan	NaN	nan	34144442.000	-118654084.000	85768.000	nan	nan	nan	nan	nan	010D	269.000	NaN	60378002.041	37688.000	3101.000	nan	96337.000	nan	nan	nan	nan	nan	nan	nan	nan	NaN	nan	9.000	2015.000	9.000	nan	NaN	nan	nan
1	10759547	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	6037.000	nan	nan	nan	nan	NaN	nan	34140430.000	-118625364.000	4083.000	nan	nan	nan	nan	nan	0109	261.000	LCA11*	60378001.011	37688.000	3101.000	nan	96337.000	nan	nan	nan	nan	nan	nan	nan	nan	NaN	nan	27516.000	2015.000	27516.000	nan	NaN	nan	nan
2	10843547	nan	nan	nan	nan	nan	nan	nan	nan	73026.000	nan	nan	73026.000	nan	nan	6037.000	nan	nan	nan	nan	NaN	nan	33989359.000	-118394633.000	63085.000	nan	nan	nan	nan	nan	1200	47.000	LAC2	60377030.012	51617.000	3101.000	nan	96095.000	nan	nan	nan	2.000	nan	nan	nan	nan	NaN	650756.000	1413387.000	2015.000	762631.000	20800.370	NaN	nan	nan
3	10859147	nan	nan	nan	3.000	7.000	nan	nan	nan	5068.000	nan	nan	5068.000	nan	nan	6037.000	nan	nan	nan	nan	NaN	nan	34148863.000	-118437206.000	7521.000	nan	nan	nan	nan	nan	1200	47.000	LAC2	60371412.023	12447.000	3101.000	27080.000	96424.000	nan	nan	nan	nan	nan	nan	1948.000	1.000	NaN	571346.000	1156834.000	2015.000	585488.000	14557.570	NaN	nan	nan
4	10879947	nan	nan	nan	4.000	nan	nan	nan	nan	1776.000	nan	nan	1776.000	nan	nan	6037.000	nan	nan	nan	nan	NaN	nan	34194168.000	-118385816.000	8512.000	nan	nan	nan	nan	nan	1210	31.000	LAM1	60371232.052	12447.000	3101.000	46795.000	96450.000	nan	nan	nan	1.000	nan	nan	1947.000	nan	NaN	193796.000	433491.000	2015.000	239695.000	5725.170	NaN	nan	nan

	index	0
0	parcelid	0
1	airconditioningtypeid	2173698
2	architecturalstyletypeid	2979156
3	basementsqft	2983589
4	bathroomcnt	11462
5	bedroomcnt	11450
6	buildingclasstypeid	2972588
7	buildingqualitytypeid	1046729
8	calculatedbathnbr	128912
9	decktypeid	2968121
10	finishedfloor1squarefeet	2782500
11	calculatedfinishedsquarefeet	55565
12	finishedsquarefeet12	276033
13	finishedsquarefeet13	2977545
14	finishedsquarefeet15	2794419
15	finishedsquarefeet50	2782500
16	finishedsquarefeet6	2963216
17	fips	11437
18	fireplacecnt	2672580
19	fullbathcnt	128912
20	garagecarcnt	2101950
21	garagetotalsqft	2101950
22	hashottuborspa	2916203
23	heatingorsystemtypeid	1178816
24	latitude	11437
25	longitude	11437
26	lotsizesquarefeet	276099
27	poolcnt	2467683
28	poolsizesum	2957257
29	pooltypeid10	2948278
30	pooltypeid2	2953142
31	pooltypeid7	2499758
32	propertycountylandusecode	12277
33	propertylandusetypeid	11437
34	propertyzoningdesc	1006588
35	rawcensustractandblock	11437
36	regionidcity	62845
37	regionidcounty	11437
38	regionidneighborhood	1828815
39	regionidzip	13980
40	roomcnt	11475
41	storytypeid	2983593
42	threequarterbathnbr	2673586
43	typeconstructiontypeid	2978470
44	unitcnt	1007727
45	yardbuildingsqft17	2904862
46	yardbuildingsqft26	2982570
47	yearbuilt	59928
48	numberofstories	2303148
49	fireplaceflag	2980054
50	structuretaxvaluedollarcnt	54982
51	taxvaluedollarcnt	42550
52	assessmentyear	11439
53	landtaxvaluedollarcnt	67733
54	taxamount	31250
55	taxdelinquencyflag	2928755
56	taxdelinquencyyear	2928753
57	censustractandblock	75126

[kaggle][필사] Zillow Prize: Zillow’s Home Value Prediction (2) (0)	2020.10.19
[kaggle][필사] TensorFlow Speech Recognition Challenge (2) (0)	2020.10.09
[kaggle][필사] TensorFlow Speech Recognition Challenge (1) (0)	2020.10.08

춤추는 개발자

[kaggle][필사] Zillow Prize: Zillow’s Home Value Prediction (1)

Zillow's Home Value Prediction

1. train data

Logerror

Transaction Date

ParceId

2. properties data

missing value

longitude, latitude

'Competition > Kaggle' 카테고리의 다른 글

'Competition/Kaggle'의 다른글

티스토리툴바

[kaggle][필사] Zillow Prize: Zillow’s Home Value Prediction (1)

Zillow's Home Value Prediction

1. train data

Logerror

Transaction Date

ParceId

2. properties data

missing value

longitude, latitude

'Competition > Kaggle' 카테고리의 다른 글

'Competition/Kaggle'의 다른글

관련글

티스토리툴바