Competition/Kaggle

[kaggle] House Prices: Advanced Regression Techniques (2) 범주형 데이터 인코딩

bisi 2020. 4. 27. 13:36

Kaggle에서 진행하는 House Prices: Advanced Regression Techniques 데이터셋을 분석하였다. 

Regresssion을 통한 집값 예측하기 위해 그전에 아래 4가지 단계로 나누어 데이터 탐색을 진행하였다.  

 

출처 : https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

 

House Prices: Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

www.kaggle.com


범주형 데이터 인코딩 

 

HouseData_Cleansing_Final_HB-Copy2

범주형 데이터 인코딩 (One-hot-Coding )

In [15]:
# 'MSSubClass' 숫자형으로된 범주형 제외
category_col_list = ["MSZoning","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Neighborhood","Condition1","Condition2","BldgType","HouseStyle","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","ExterQual","ExterCond","Foundation","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","Heating","HeatingQC","CentralAir","Electrical","KitchenQual","Functional","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PavedDrive","PoolQC","Fence","MiscFeature","SaleType","SaleCondition"]

category_col_c_df = category_col_df.copy()
# category_col_c_df.fillna('missing', inplace=True)

def cate_to_num(data, col_name,list):
    le = LabelEncoder()
    le.fit(list)
    data[col_name] = le.transform(data[col_name]) 
    return data

for category_col in category_col_list:
    # 결측치 어떻게 다룰지 고민.. 우선 string으로 처리.
    col_value_list = category_col_c_df[category_col].unique().tolist()    
    print("list : ",col_value_list)
    category_col_c_df = cate_to_num(category_col_c_df,category_col,col_value_list)    


print("data : ",category_col_c_df)
list :  ['RL', 'RM', 'C (all)', 'FV', 'RH']
list :  ['Pave', 'Grvl']
list :  ['missing', 'Grvl', 'Pave']
list :  ['Reg', 'IR1', 'IR2', 'IR3']
list :  ['Lvl', 'Bnk', 'Low', 'HLS']
list :  ['AllPub', 'NoSeWa']
list :  ['Inside', 'FR2', 'Corner', 'CulDSac', 'FR3']
list :  ['Gtl', 'Mod', 'Sev']
list :  ['CollgCr', 'Veenker', 'Crawfor', 'NoRidge', 'Mitchel', 'Somerst', 'NWAmes', 'OldTown', 'BrkSide', 'Sawyer', 'NridgHt', 'NAmes', 'SawyerW', 'IDOTRR', 'MeadowV', 'Edwards', 'Timber', 'Gilbert', 'StoneBr', 'ClearCr', 'NPkVill', 'Blmngtn', 'BrDale', 'SWISU', 'Blueste']
list :  ['Norm', 'Feedr', 'PosN', 'Artery', 'RRAe', 'RRNn', 'RRAn', 'PosA', 'RRNe']
list :  ['Norm', 'Artery', 'RRNn', 'Feedr', 'PosN', 'PosA', 'RRAn', 'RRAe']
list :  ['1Fam', '2fmCon', 'Duplex', 'TwnhsE', 'Twnhs']
list :  ['2Story', '1Story', '1.5Fin', '1.5Unf', 'SFoyer', 'SLvl', '2.5Unf', '2.5Fin']
list :  ['Gable', 'Hip', 'Gambrel', 'Mansard', 'Flat', 'Shed']
list :  ['CompShg', 'WdShngl', 'Metal', 'WdShake', 'Membran', 'Tar&Grv', 'Roll', 'ClyTile']
list :  ['VinylSd', 'MetalSd', 'Wd Sdng', 'HdBoard', 'BrkFace', 'WdShing', 'CemntBd', 'Plywood', 'AsbShng', 'Stucco', 'BrkComm', 'AsphShn', 'Stone', 'ImStucc', 'CBlock']
list :  ['VinylSd', 'MetalSd', 'Wd Shng', 'HdBoard', 'Plywood', 'Wd Sdng', 'CmentBd', 'BrkFace', 'Stucco', 'AsbShng', 'Brk Cmn', 'ImStucc', 'AsphShn', 'Stone', 'Other', 'CBlock']
list :  ['BrkFace', 'None', 'Stone', 'BrkCmn', 'missing']
list :  ['Gd', 'TA', 'Ex', 'Fa']
list :  ['TA', 'Gd', 'Fa', 'Po', 'Ex']
list :  ['PConc', 'CBlock', 'BrkTil', 'Wood', 'Slab', 'Stone']
list :  ['Gd', 'TA', 'Ex', 'missing', 'Fa']
list :  ['TA', 'Gd', 'missing', 'Fa', 'Po']
list :  ['No', 'Gd', 'Mn', 'Av', 'missing']
list :  ['GLQ', 'ALQ', 'Unf', 'Rec', 'BLQ', 'missing', 'LwQ']
list :  ['Unf', 'BLQ', 'missing', 'ALQ', 'Rec', 'LwQ', 'GLQ']
list :  ['GasA', 'GasW', 'Grav', 'Wall', 'OthW', 'Floor']
list :  ['Ex', 'Gd', 'TA', 'Fa', 'Po']
list :  ['Y', 'N']
list :  ['SBrkr', 'FuseF', 'FuseA', 'FuseP', 'Mix', 'missing']
list :  ['Gd', 'TA', 'Ex', 'Fa']
list :  ['Typ', 'Min1', 'Maj1', 'Min2', 'Mod', 'Maj2', 'Sev']
list :  ['missing', 'TA', 'Gd', 'Fa', 'Ex', 'Po']
list :  ['Attchd', 'Detchd', 'BuiltIn', 'CarPort', 'missing', 'Basment', '2Types']
list :  ['RFn', 'Unf', 'Fin', 'missing']
list :  ['TA', 'Fa', 'Gd', 'missing', 'Ex', 'Po']
list :  ['TA', 'Fa', 'missing', 'Gd', 'Po', 'Ex']
list :  ['Y', 'N', 'P']
list :  ['missing', 'Ex', 'Fa', 'Gd']
list :  ['missing', 'MnPrv', 'GdWo', 'GdPrv', 'MnWw']
list :  ['missing', 'Shed', 'Gar2', 'Othr', 'TenC']
list :  ['WD', 'New', 'COD', 'ConLD', 'ConLI', 'CWD', 'ConLw', 'Con', 'Oth']
list :  ['Normal', 'Abnorml', 'Partial', 'AdjLand', 'Alloca', 'Family']
data :          Id  MSSubClass  MSZoning  Street  Alley  LotShape  LandContour  \
0        1          60         3       1      2         3            3   
1        2          20         3       1      2         3            3   
2        3          60         3       1      2         0            3   
3        4          70         3       1      2         0            3   
4        5          60         3       1      2         0            3   
5        6          50         3       1      2         0            3   
6        7          20         3       1      2         3            3   
7        8          60         3       1      2         0            3   
8        9          50         4       1      2         3            3   
9       10         190         3       1      2         3            3   
10      11          20         3       1      2         3            3   
11      12          60         3       1      2         0            3   
12      13          20         3       1      2         1            3   
13      14          20         3       1      2         0            3   
14      15          20         3       1      2         0            3   
15      16          45         4       1      2         3            3   
16      17          20         3       1      2         0            3   
17      18          90         3       1      2         3            3   
18      19          20         3       1      2         3            3   
19      20          20         3       1      2         3            3   
20      21          60         3       1      2         0            3   
21      22          45         4       1      0         3            0   
22      23          20         3       1      2         3            3   
23      24         120         4       1      2         3            3   
24      25          20         3       1      2         0            3   
25      26          20         3       1      2         3            3   
26      27          20         3       1      2         3            3   
27      28          20         3       1      2         3            3   
28      29          20         3       1      2         0            3   
29      30          30         4       1      2         0            3   
...    ...         ...       ...     ...    ...       ...          ...   
1430  1431          60         3       1      2         2            3   
1431  1432         120         3       1      2         0            3   
1432  1433          30         3       1      0         3            3   
1433  1434          60         3       1      2         0            3   
1434  1435          20         3       1      2         3            2   
1435  1436          20         3       1      2         3            3   
1436  1437          20         3       1      2         3            3   
1437  1438          20         3       1      2         3            3   
1438  1439          20         4       1      2         3            3   
1439  1440          60         3       1      2         3            3   
1440  1441          70         3       1      2         0            0   
1441  1442         120         4       1      2         3            3   
1442  1443          60         1       1      2         3            3   
1443  1444          30         3       1      2         3            3   
1444  1445          20         3       1      2         3            3   
1445  1446          85         3       1      2         3            3   
1446  1447          20         3       1      2         0            3   
1447  1448          60         3       1      2         3            3   
1448  1449          50         3       1      2         3            3   
1449  1450         180         4       1      2         3            3   
1450  1451          90         3       1      2         3            3   
1451  1452          20         3       1      2         3            3   
1452  1453         180         4       1      2         3            3   
1453  1454          20         3       1      2         3            3   
1454  1455          20         1       1      1         3            3   
1455  1456          60         3       1      2         3            3   
1456  1457          20         3       1      2         3            3   
1457  1458          70         3       1      2         3            3   
1458  1459          20         3       1      2         3            3   
1459  1460          20         3       1      2         3            3   

      Utilities  LotConfig  LandSlope  ...  GarageQual  GarageCond  \
0             0          4          0  ...           4           4   
1             0          2          0  ...           4           4   
2             0          4          0  ...           4           4   
3             0          0          0  ...           4           4   
4             0          2          0  ...           4           4   
5             0          4          0  ...           4           4   
6             0          4          0  ...           4           4   
7             0          0          0  ...           4           4   
8             0          4          0  ...           1           4   
9             0          0          0  ...           2           4   
10            0          4          0  ...           4           4   
11            0          4          0  ...           4           4   
12            0          4          0  ...           4           4   
13            0          4          0  ...           4           4   
14            0          0          0  ...           4           4   
15            0          0          0  ...           4           4   
16            0          1          0  ...           4           4   
17            0          4          0  ...           4           4   
18            0          4          0  ...           4           4   
19            0          4          0  ...           4           4   
20            0          0          0  ...           4           4   
21            0          4          0  ...           4           4   
22            0          4          0  ...           4           4   
23            0          4          0  ...           4           4   
24            0          4          0  ...           4           4   
25            0          0          0  ...           4           4   
26            0          0          0  ...           4           4   
27            0          4          0  ...           4           4   
28            0          1          0  ...           4           4   
29            0          4          0  ...           1           4   
...         ...        ...        ...  ...         ...         ...   
1430          0          4          0  ...           4           4   
1431          0          4          0  ...           4           4   
1432          0          4          0  ...           1           1   
1433          0          4          0  ...           4           4   
1434          0          4          1  ...           4           4   
1435          0          4          0  ...           4           4   
1436          0          2          0  ...           4           4   
1437          0          2          0  ...           4           4   
1438          0          4          0  ...           4           4   
1439          0          4          0  ...           4           4   
1440          0          4          1  ...           4           4   
1441          0          4          0  ...           4           4   
1442          0          4          0  ...           4           4   
1443          0          4          0  ...           1           3   
1444          0          2          0  ...           4           4   
1445          0          4          0  ...           4           4   
1446          0          1          0  ...           4           4   
1447          0          4          0  ...           4           4   
1448          0          4          0  ...           1           4   
1449          0          4          0  ...           5           5   
1450          0          2          0  ...           5           5   
1451          0          4          0  ...           4           4   
1452          0          4          0  ...           4           4   
1453          0          4          0  ...           5           5   
1454          0          4          0  ...           4           4   
1455          0          4          0  ...           4           4   
1456          0          4          0  ...           4           4   
1457          0          4          0  ...           4           4   
1458          0          4          0  ...           4           4   
1459          0          4          0  ...           4           4   

      PavedDrive  PoolQC  Fence  MiscFeature  MoSold  YrSold  SaleType  \
0              2       3      4            4       2    2008         8   
1              2       3      4            4       5    2007         8   
2              2       3      4            4       9    2008         8   
3              2       3      4            4       2    2006         8   
4              2       3      4            4      12    2008         8   
5              2       3      2            2      10    2009         8   
6              2       3      4            4       8    2007         8   
7              2       3      4            2      11    2009         8   
8              2       3      4            4       4    2008         8   
9              2       3      4            4       1    2008         8   
10             2       3      4            4       2    2008         8   
11             2       3      4            4       7    2006         6   
12             2       3      4            4       9    2008         8   
13             2       3      4            4       8    2007         6   
14             2       3      1            4       5    2008         8   
15             2       3      0            4       7    2007         8   
16             2       3      4            2       3    2010         8   
17             2       3      4            2      10    2006         8   
18             2       3      4            4       6    2008         8   
19             2       3      2            4       5    2009         0   
20             2       3      4            4      11    2006         6   
21             0       3      0            4       6    2007         8   
22             2       3      4            4       9    2008         8   
23             2       3      4            4       6    2007         8   
24             2       3      2            4       5    2010         8   
25             2       3      4            4       7    2009         8   
26             2       3      4            4       5    2010         8   
27             2       3      4            4       5    2010         8   
28             2       3      4            4      12    2006         8   
29             2       3      4            4       5    2008         8   
...          ...     ...    ...          ...     ...     ...       ...   
1430           2       3      4            4       7    2006         8   
1431           2       3      4            4      10    2009         8   
1432           2       3      4            4       8    2007         8   
1433           2       3      4            4       5    2008         8   
1434           1       3      4            4       5    2006         8   
1435           2       3      0            4       7    2008         0   
1436           2       3      1            4       5    2007         8   
1437           2       3      4            4      11    2008         6   
1438           2       3      2            4       4    2010         8   
1439           2       3      4            4      11    2007         8   
1440           2       3      4            4       9    2008         8   
1441           2       3      4            4       5    2008         8   
1442           2       3      4            4       4    2009         8   
1443           1       3      4            4       5    2009         8   
1444           2       3      4            4      11    2007         8   
1445           2       3      4            4       5    2007         8   
1446           1       3      4            4       4    2010         8   
1447           2       3      4            4      12    2007         8   
1448           2       3      1            4       5    2007         8   
1449           2       3      4            4       8    2006         8   
1450           2       3      4            4       9    2009         8   
1451           2       3      4            4       5    2009         6   
1452           2       3      4            4       5    2006         8   
1453           2       3      4            4       7    2006         8   
1454           2       3      4            4      10    2009         8   
1455           2       3      4            4       8    2007         8   
1456           2       3      2            4       2    2010         8   
1457           2       3      0            2       5    2010         8   
1458           2       3      4            4       4    2010         8   
1459           2       3      4            4       6    2008         8   

      SaleCondition  
0                 4  
1                 4  
2                 4  
3                 0  
4                 4  
5                 4  
6                 4  
7                 4  
8                 0  
9                 4  
10                4  
11                5  
12                4  
13                5  
14                4  
15                4  
16                4  
17                4  
18                4  
19                0  
20                5  
21                4  
22                4  
23                4  
24                4  
25                4  
26                4  
27                4  
28                4  
29                4  
...             ...  
1430              4  
1431              4  
1432              4  
1433              4  
1434              4  
1435              0  
1436              4  
1437              5  
1438              4  
1439              4  
1440              4  
1441              4  
1442              4  
1443              4  
1444              4  
1445              4  
1446              4  
1447              4  
1448              4  
1449              0  
1450              4  
1451              5  
1452              4  
1453              0  
1454              4  
1455              4  
1456              4  
1457              4  
1458              4  
1459              4  

[1460 rows x 50 columns]

 


관련 글 모아보기

 

[kaggle] House Prices: Advanced Regression Techniques (1) 데이터형태
[kaggle] House Prices: Advanced Regression Techniques (2) 범주형 데이터 인코딩 
[kaggle] House Prices: Advanced Regression Techniques (3) 그래프 
[kaggle] House Prices: Advanced Regression Techniques (4) 상관관계, 정규 분포