Competition/Kaggle
[kaggle] House Prices: Advanced Regression Techniques (2) 범주형 데이터 인코딩
bisi
2020. 4. 27. 13:36
Kaggle에서 진행하는 House Prices: Advanced Regression Techniques 데이터셋을 분석하였다.
Regresssion을 통한 집값 예측하기 위해 그전에 아래 4가지 단계로 나누어 데이터 탐색을 진행하였다.
출처 : https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
범주형 데이터 인코딩
범주형 데이터 인코딩 (One-hot-Coding )¶
In [15]:
# 'MSSubClass' 숫자형으로된 범주형 제외
category_col_list = ["MSZoning","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Neighborhood","Condition1","Condition2","BldgType","HouseStyle","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","ExterQual","ExterCond","Foundation","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","Heating","HeatingQC","CentralAir","Electrical","KitchenQual","Functional","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PavedDrive","PoolQC","Fence","MiscFeature","SaleType","SaleCondition"]
category_col_c_df = category_col_df.copy()
# category_col_c_df.fillna('missing', inplace=True)
def cate_to_num(data, col_name,list):
le = LabelEncoder()
le.fit(list)
data[col_name] = le.transform(data[col_name])
return data
for category_col in category_col_list:
# 결측치 어떻게 다룰지 고민.. 우선 string으로 처리.
col_value_list = category_col_c_df[category_col].unique().tolist()
print("list : ",col_value_list)
category_col_c_df = cate_to_num(category_col_c_df,category_col,col_value_list)
print("data : ",category_col_c_df)
list : ['RL', 'RM', 'C (all)', 'FV', 'RH'] list : ['Pave', 'Grvl'] list : ['missing', 'Grvl', 'Pave'] list : ['Reg', 'IR1', 'IR2', 'IR3'] list : ['Lvl', 'Bnk', 'Low', 'HLS'] list : ['AllPub', 'NoSeWa'] list : ['Inside', 'FR2', 'Corner', 'CulDSac', 'FR3'] list : ['Gtl', 'Mod', 'Sev'] list : ['CollgCr', 'Veenker', 'Crawfor', 'NoRidge', 'Mitchel', 'Somerst', 'NWAmes', 'OldTown', 'BrkSide', 'Sawyer', 'NridgHt', 'NAmes', 'SawyerW', 'IDOTRR', 'MeadowV', 'Edwards', 'Timber', 'Gilbert', 'StoneBr', 'ClearCr', 'NPkVill', 'Blmngtn', 'BrDale', 'SWISU', 'Blueste'] list : ['Norm', 'Feedr', 'PosN', 'Artery', 'RRAe', 'RRNn', 'RRAn', 'PosA', 'RRNe'] list : ['Norm', 'Artery', 'RRNn', 'Feedr', 'PosN', 'PosA', 'RRAn', 'RRAe'] list : ['1Fam', '2fmCon', 'Duplex', 'TwnhsE', 'Twnhs'] list : ['2Story', '1Story', '1.5Fin', '1.5Unf', 'SFoyer', 'SLvl', '2.5Unf', '2.5Fin'] list : ['Gable', 'Hip', 'Gambrel', 'Mansard', 'Flat', 'Shed'] list : ['CompShg', 'WdShngl', 'Metal', 'WdShake', 'Membran', 'Tar&Grv', 'Roll', 'ClyTile'] list : ['VinylSd', 'MetalSd', 'Wd Sdng', 'HdBoard', 'BrkFace', 'WdShing', 'CemntBd', 'Plywood', 'AsbShng', 'Stucco', 'BrkComm', 'AsphShn', 'Stone', 'ImStucc', 'CBlock'] list : ['VinylSd', 'MetalSd', 'Wd Shng', 'HdBoard', 'Plywood', 'Wd Sdng', 'CmentBd', 'BrkFace', 'Stucco', 'AsbShng', 'Brk Cmn', 'ImStucc', 'AsphShn', 'Stone', 'Other', 'CBlock'] list : ['BrkFace', 'None', 'Stone', 'BrkCmn', 'missing'] list : ['Gd', 'TA', 'Ex', 'Fa'] list : ['TA', 'Gd', 'Fa', 'Po', 'Ex'] list : ['PConc', 'CBlock', 'BrkTil', 'Wood', 'Slab', 'Stone'] list : ['Gd', 'TA', 'Ex', 'missing', 'Fa'] list : ['TA', 'Gd', 'missing', 'Fa', 'Po'] list : ['No', 'Gd', 'Mn', 'Av', 'missing'] list : ['GLQ', 'ALQ', 'Unf', 'Rec', 'BLQ', 'missing', 'LwQ'] list : ['Unf', 'BLQ', 'missing', 'ALQ', 'Rec', 'LwQ', 'GLQ'] list : ['GasA', 'GasW', 'Grav', 'Wall', 'OthW', 'Floor'] list : ['Ex', 'Gd', 'TA', 'Fa', 'Po'] list : ['Y', 'N'] list : ['SBrkr', 'FuseF', 'FuseA', 'FuseP', 'Mix', 'missing'] list : ['Gd', 'TA', 'Ex', 'Fa'] list : ['Typ', 'Min1', 'Maj1', 'Min2', 'Mod', 'Maj2', 'Sev'] list : ['missing', 'TA', 'Gd', 'Fa', 'Ex', 'Po'] list : ['Attchd', 'Detchd', 'BuiltIn', 'CarPort', 'missing', 'Basment', '2Types'] list : ['RFn', 'Unf', 'Fin', 'missing'] list : ['TA', 'Fa', 'Gd', 'missing', 'Ex', 'Po'] list : ['TA', 'Fa', 'missing', 'Gd', 'Po', 'Ex'] list : ['Y', 'N', 'P'] list : ['missing', 'Ex', 'Fa', 'Gd'] list : ['missing', 'MnPrv', 'GdWo', 'GdPrv', 'MnWw'] list : ['missing', 'Shed', 'Gar2', 'Othr', 'TenC'] list : ['WD', 'New', 'COD', 'ConLD', 'ConLI', 'CWD', 'ConLw', 'Con', 'Oth'] list : ['Normal', 'Abnorml', 'Partial', 'AdjLand', 'Alloca', 'Family'] data : Id MSSubClass MSZoning Street Alley LotShape LandContour \ 0 1 60 3 1 2 3 3 1 2 20 3 1 2 3 3 2 3 60 3 1 2 0 3 3 4 70 3 1 2 0 3 4 5 60 3 1 2 0 3 5 6 50 3 1 2 0 3 6 7 20 3 1 2 3 3 7 8 60 3 1 2 0 3 8 9 50 4 1 2 3 3 9 10 190 3 1 2 3 3 10 11 20 3 1 2 3 3 11 12 60 3 1 2 0 3 12 13 20 3 1 2 1 3 13 14 20 3 1 2 0 3 14 15 20 3 1 2 0 3 15 16 45 4 1 2 3 3 16 17 20 3 1 2 0 3 17 18 90 3 1 2 3 3 18 19 20 3 1 2 3 3 19 20 20 3 1 2 3 3 20 21 60 3 1 2 0 3 21 22 45 4 1 0 3 0 22 23 20 3 1 2 3 3 23 24 120 4 1 2 3 3 24 25 20 3 1 2 0 3 25 26 20 3 1 2 3 3 26 27 20 3 1 2 3 3 27 28 20 3 1 2 3 3 28 29 20 3 1 2 0 3 29 30 30 4 1 2 0 3 ... ... ... ... ... ... ... ... 1430 1431 60 3 1 2 2 3 1431 1432 120 3 1 2 0 3 1432 1433 30 3 1 0 3 3 1433 1434 60 3 1 2 0 3 1434 1435 20 3 1 2 3 2 1435 1436 20 3 1 2 3 3 1436 1437 20 3 1 2 3 3 1437 1438 20 3 1 2 3 3 1438 1439 20 4 1 2 3 3 1439 1440 60 3 1 2 3 3 1440 1441 70 3 1 2 0 0 1441 1442 120 4 1 2 3 3 1442 1443 60 1 1 2 3 3 1443 1444 30 3 1 2 3 3 1444 1445 20 3 1 2 3 3 1445 1446 85 3 1 2 3 3 1446 1447 20 3 1 2 0 3 1447 1448 60 3 1 2 3 3 1448 1449 50 3 1 2 3 3 1449 1450 180 4 1 2 3 3 1450 1451 90 3 1 2 3 3 1451 1452 20 3 1 2 3 3 1452 1453 180 4 1 2 3 3 1453 1454 20 3 1 2 3 3 1454 1455 20 1 1 1 3 3 1455 1456 60 3 1 2 3 3 1456 1457 20 3 1 2 3 3 1457 1458 70 3 1 2 3 3 1458 1459 20 3 1 2 3 3 1459 1460 20 3 1 2 3 3 Utilities LotConfig LandSlope ... GarageQual GarageCond \ 0 0 4 0 ... 4 4 1 0 2 0 ... 4 4 2 0 4 0 ... 4 4 3 0 0 0 ... 4 4 4 0 2 0 ... 4 4 5 0 4 0 ... 4 4 6 0 4 0 ... 4 4 7 0 0 0 ... 4 4 8 0 4 0 ... 1 4 9 0 0 0 ... 2 4 10 0 4 0 ... 4 4 11 0 4 0 ... 4 4 12 0 4 0 ... 4 4 13 0 4 0 ... 4 4 14 0 0 0 ... 4 4 15 0 0 0 ... 4 4 16 0 1 0 ... 4 4 17 0 4 0 ... 4 4 18 0 4 0 ... 4 4 19 0 4 0 ... 4 4 20 0 0 0 ... 4 4 21 0 4 0 ... 4 4 22 0 4 0 ... 4 4 23 0 4 0 ... 4 4 24 0 4 0 ... 4 4 25 0 0 0 ... 4 4 26 0 0 0 ... 4 4 27 0 4 0 ... 4 4 28 0 1 0 ... 4 4 29 0 4 0 ... 1 4 ... ... ... ... ... ... ... 1430 0 4 0 ... 4 4 1431 0 4 0 ... 4 4 1432 0 4 0 ... 1 1 1433 0 4 0 ... 4 4 1434 0 4 1 ... 4 4 1435 0 4 0 ... 4 4 1436 0 2 0 ... 4 4 1437 0 2 0 ... 4 4 1438 0 4 0 ... 4 4 1439 0 4 0 ... 4 4 1440 0 4 1 ... 4 4 1441 0 4 0 ... 4 4 1442 0 4 0 ... 4 4 1443 0 4 0 ... 1 3 1444 0 2 0 ... 4 4 1445 0 4 0 ... 4 4 1446 0 1 0 ... 4 4 1447 0 4 0 ... 4 4 1448 0 4 0 ... 1 4 1449 0 4 0 ... 5 5 1450 0 2 0 ... 5 5 1451 0 4 0 ... 4 4 1452 0 4 0 ... 4 4 1453 0 4 0 ... 5 5 1454 0 4 0 ... 4 4 1455 0 4 0 ... 4 4 1456 0 4 0 ... 4 4 1457 0 4 0 ... 4 4 1458 0 4 0 ... 4 4 1459 0 4 0 ... 4 4 PavedDrive PoolQC Fence MiscFeature MoSold YrSold SaleType \ 0 2 3 4 4 2 2008 8 1 2 3 4 4 5 2007 8 2 2 3 4 4 9 2008 8 3 2 3 4 4 2 2006 8 4 2 3 4 4 12 2008 8 5 2 3 2 2 10 2009 8 6 2 3 4 4 8 2007 8 7 2 3 4 2 11 2009 8 8 2 3 4 4 4 2008 8 9 2 3 4 4 1 2008 8 10 2 3 4 4 2 2008 8 11 2 3 4 4 7 2006 6 12 2 3 4 4 9 2008 8 13 2 3 4 4 8 2007 6 14 2 3 1 4 5 2008 8 15 2 3 0 4 7 2007 8 16 2 3 4 2 3 2010 8 17 2 3 4 2 10 2006 8 18 2 3 4 4 6 2008 8 19 2 3 2 4 5 2009 0 20 2 3 4 4 11 2006 6 21 0 3 0 4 6 2007 8 22 2 3 4 4 9 2008 8 23 2 3 4 4 6 2007 8 24 2 3 2 4 5 2010 8 25 2 3 4 4 7 2009 8 26 2 3 4 4 5 2010 8 27 2 3 4 4 5 2010 8 28 2 3 4 4 12 2006 8 29 2 3 4 4 5 2008 8 ... ... ... ... ... ... ... ... 1430 2 3 4 4 7 2006 8 1431 2 3 4 4 10 2009 8 1432 2 3 4 4 8 2007 8 1433 2 3 4 4 5 2008 8 1434 1 3 4 4 5 2006 8 1435 2 3 0 4 7 2008 0 1436 2 3 1 4 5 2007 8 1437 2 3 4 4 11 2008 6 1438 2 3 2 4 4 2010 8 1439 2 3 4 4 11 2007 8 1440 2 3 4 4 9 2008 8 1441 2 3 4 4 5 2008 8 1442 2 3 4 4 4 2009 8 1443 1 3 4 4 5 2009 8 1444 2 3 4 4 11 2007 8 1445 2 3 4 4 5 2007 8 1446 1 3 4 4 4 2010 8 1447 2 3 4 4 12 2007 8 1448 2 3 1 4 5 2007 8 1449 2 3 4 4 8 2006 8 1450 2 3 4 4 9 2009 8 1451 2 3 4 4 5 2009 6 1452 2 3 4 4 5 2006 8 1453 2 3 4 4 7 2006 8 1454 2 3 4 4 10 2009 8 1455 2 3 4 4 8 2007 8 1456 2 3 2 4 2 2010 8 1457 2 3 0 2 5 2010 8 1458 2 3 4 4 4 2010 8 1459 2 3 4 4 6 2008 8 SaleCondition 0 4 1 4 2 4 3 0 4 4 5 4 6 4 7 4 8 0 9 4 10 4 11 5 12 4 13 5 14 4 15 4 16 4 17 4 18 4 19 0 20 5 21 4 22 4 23 4 24 4 25 4 26 4 27 4 28 4 29 4 ... ... 1430 4 1431 4 1432 4 1433 4 1434 4 1435 0 1436 4 1437 5 1438 4 1439 4 1440 4 1441 4 1442 4 1443 4 1444 4 1445 4 1446 4 1447 4 1448 4 1449 0 1450 4 1451 5 1452 4 1453 0 1454 4 1455 4 1456 4 1457 4 1458 4 1459 4 [1460 rows x 50 columns]
관련 글 모아보기
[kaggle] House Prices: Advanced Regression Techniques (1) 데이터형태
[kaggle] House Prices: Advanced Regression Techniques (2) 범주형 데이터 인코딩
[kaggle] House Prices: Advanced Regression Techniques (3) 그래프
[kaggle] House Prices: Advanced Regression Techniques (4) 상관관계, 정규 분포
'Competition > Kaggle' 카테고리의 다른 글
[kaggle] House Prices: Advanced Regression Techniques (3) 그래프 (0) | 2020.04.28 |
---|---|
[kaggle] House Prices: Advanced Regression Techniques (1) 데이터 형태 (0) | 2020.04.26 |
캐클 스터디 커리큘럼 (0) | 2020.03.11 |
'Competition/Kaggle'의 다른글
- 현재글[kaggle] House Prices: Advanced Regression Techniques (2) 범주형 데이터 인코딩