Porto-serqruo-safe-prediction

Porto-serqruo는 보험회사 이름이다.
이 보험회사는 자동차 보험 청구 예측을 통해 보험 비용을 관리하려고 한다.
예측이 부정확하면 좋은 운전자의 보험 비용이 상승하고, 예측이 정확하면 나쁜 운전자의 경우 가격이 낮아진다.
목표 : 운전자가 내년에 자동차 보험 청구를 시작할 확률울 예측하는 모델을 구축 해야함.

# Loading package

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Imputer was deprecated 3 versions ago and remove in 0.22
# from sklearn.preprocessing import Imputer 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier

pd.set_option('display.max_columns',100)

1. 데이터 확인

Loading data

train = pd.read_csv('../data/porto-seguro-safe-driver-prediction/train.csv')
test = pd.read_csv('../data/porto-seguro-safe-driver-prediction/test.csv')

데이터 설명
- 자동차 보험 계약자가 클레임을 제기 할 확률을 예측
- 유사한 그룹화에 속하는 변수(Feature)는 변수 이름에 태그가 지정된다.(예: ind, reg, car, calc)
- 변수 이름에는 이항 변수를 나타내는 bin 전두사가 포함되어 있거나, 범주형 변수을 나타내는 cat 등이 있다.
- 이러한 지정이 없는 형상은 연속형 또는 순서형이다.
- -1 값은 관측치에서 형상이 누락되었음을 나타낸다.
- target 열은 해당 정책 보유자에 대한 청구가 제기되었는지 여부를 나타낸다.

train.head()

train.tail()

위의 데이터 설명에서 참고할 점 다시 요약
- 이항 변수
- 범주화 값이 숫자인 범주화 변수
- int or float 값인 변수
- -1 누락값
- target 변수 과 ID 변수

# 데이터 갯수 확인
train.shape

(595212, 59)

# 동일한 변수값이 들어가 있을 경우, 삭제하고 다시 데이터 갯수 확인
train.drop_duplicates()
train.shape

(595212, 59)

# test 데이터의 row가 하나 없는 이유는 target 변수가 생략되었기 때문이다. 이 부분이 우리가 예측해야 할 부분이다. 
test.shape

(892816, 58)

# 14개의 범주형의 더미변수를 만들어보자. `bin` 변수는 이미 이상 변수여서 더미화 할 필요없다.
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595212 entries, 0 to 595211
Data columns (total 59 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              595212 non-null  int64  
 1   target          595212 non-null  int64  
 2   ps_ind_01       595212 non-null  int64  
 3   ps_ind_02_cat   595212 non-null  int64  
 4   ps_ind_03       595212 non-null  int64  
 5   ps_ind_04_cat   595212 non-null  int64  
 6   ps_ind_05_cat   595212 non-null  int64  
 7   ps_ind_06_bin   595212 non-null  int64  
 8   ps_ind_07_bin   595212 non-null  int64  
 9   ps_ind_08_bin   595212 non-null  int64  
 10  ps_ind_09_bin   595212 non-null  int64  
 11  ps_ind_10_bin   595212 non-null  int64  
 12  ps_ind_11_bin   595212 non-null  int64  
 13  ps_ind_12_bin   595212 non-null  int64  
 14  ps_ind_13_bin   595212 non-null  int64  
 15  ps_ind_14       595212 non-null  int64  
 16  ps_ind_15       595212 non-null  int64  
 17  ps_ind_16_bin   595212 non-null  int64  
 18  ps_ind_17_bin   595212 non-null  int64  
 19  ps_ind_18_bin   595212 non-null  int64  
 20  ps_reg_01       595212 non-null  float64
 21  ps_reg_02       595212 non-null  float64
 22  ps_reg_03       595212 non-null  float64
 23  ps_car_01_cat   595212 non-null  int64  
 24  ps_car_02_cat   595212 non-null  int64  
 25  ps_car_03_cat   595212 non-null  int64  
 26  ps_car_04_cat   595212 non-null  int64  
 27  ps_car_05_cat   595212 non-null  int64  
 28  ps_car_06_cat   595212 non-null  int64  
 29  ps_car_07_cat   595212 non-null  int64  
 30  ps_car_08_cat   595212 non-null  int64  
 31  ps_car_09_cat   595212 non-null  int64  
 32  ps_car_10_cat   595212 non-null  int64  
 33  ps_car_11_cat   595212 non-null  int64  
 34  ps_car_11       595212 non-null  int64  
 35  ps_car_12       595212 non-null  float64
 36  ps_car_13       595212 non-null  float64
 37  ps_car_14       595212 non-null  float64
 38  ps_car_15       595212 non-null  float64
 39  ps_calc_01      595212 non-null  float64
 40  ps_calc_02      595212 non-null  float64
 41  ps_calc_03      595212 non-null  float64
 42  ps_calc_04      595212 non-null  int64  
 43  ps_calc_05      595212 non-null  int64  
 44  ps_calc_06      595212 non-null  int64  
 45  ps_calc_07      595212 non-null  int64  
 46  ps_calc_08      595212 non-null  int64  
 47  ps_calc_09      595212 non-null  int64  
 48  ps_calc_10      595212 non-null  int64  
 49  ps_calc_11      595212 non-null  int64  
 50  ps_calc_12      595212 non-null  int64  
 51  ps_calc_13      595212 non-null  int64  
 52  ps_calc_14      595212 non-null  int64  
 53  ps_calc_15_bin  595212 non-null  int64  
 54  ps_calc_16_bin  595212 non-null  int64  
 55  ps_calc_17_bin  595212 non-null  int64  
 56  ps_calc_18_bin  595212 non-null  int64  
 57  ps_calc_19_bin  595212 non-null  int64  
 58  ps_calc_20_bin  595212 non-null  int64  
dtypes: float64(10), int64(49)
memory usage: 267.9 MB

2. Metadata

데이터 관리를 용이하게 하기 위해 Feature에 대한 메타 정보를 DataFrame에 저장한다.
분석, 시각화, 모델링 등을 위해 특정 변수를 선택하고자 할 때 유용할 것이다.
명확하게 분리해보자.
- role : input, ID, target
- level : nomial, interval, ordinal, binary
- keep : True or False
- dtype : int, float, str

data = []
for f in train.columns:
    # Define the role
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
        
    # Define the level    
    if 'bin' in f or f == 'target':        
        level = 'binary' # 이상 변수
    elif 'cat' in f or f == 'id':
        level = 'nominal' # 방향, 숫자의미 없이 카테고리 변수 , 명목 척도
    elif train[f].dtype == 'float64':                
        level = 'interval' # 등간척도, 대표적 예는 온도, 0도가 열이 없는건 아니다.  cf) 비율 척도 ratio scale, 예 : 키 몸무게, 0은 아무것도 없는 것을 의미함.
    elif train[f].dtype == 'int64':
        level = 'ordinal' # 랭킹, 순서, 스케일링등 숫자가 의미 있는 변수, 서열 척도

    # id value는 제외
    keep = True
    if f == 'id':
        keep = False

    # data type 정의 
    dtype = train[f].dtype
    
    # 변수의 메타데이터를 포함한 dict 만들기 
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep, 
        'dtype': dtype
    }
    level = ''
    data.append(f_dict)

meta = pd.DataFrame(data, columns=['varname', 'role', 'level','keep','dtype'])
meta.set_index('varname', inplace=True)

meta

삭제되지 않은 모든 nominal variables를 추출하는 예제

meta[(meta.level == 'nominal') & (meta.keep)].index

Index(['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat',
       'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat',
       'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat',
       'ps_car_10_cat', 'ps_car_11_cat'],
      dtype='object', name='varname')

#  역할 및 수준별 변수의 갯수를 볼수 있음.

pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()

3. 기술 통계(Descriptive Statistics)

데이터 프레임에 기술 통계을 적용
범주형 변수와 ID 변수에 대한 평균, std, ...를 계산하는 것은 옳지않으므로, 나중에 시각적으로 탐구
메타 파일을 활용하여 기술 통계량을 계산할 변수를 쉽게 선택 가능

# Interval variables
v = meta[(meta.level == 'interval') & (meta.keep)].index # interval : 데이터 형식이 float, keep : id 컬럼 제외
train[v].describe()

# Oridinal variable
v = meta[(meta.level == 'ordinal') & (meta.keep)].index 
train[v].describe()

# Binary variable
v = meta[(meta.level == 'binary') & (meta.keep)].index 
train[v].describe()

train 데이터의 분포는 3.645%로, 균형이 매우 맞지 않는다.
평균을 통해 대부분의 변수의 값이 대부분의 경우 0이라고 결론을 내릴 수 있다.

4. 불균형 클래스 처리

위에서 언급했듯이 목표=1을 가진 기록의 비율은 목표=0보다 훨씬 적다. 이것은 정확성은 크지만 실제에 있어서 어떠한 부가 가치를 지닌 모델을 이끌어낼 수 있다. 이 문제를 해결하기 위한 두 가지 가능한 전략은 다음과 같다.

타겟=1로 레코드를 oversampling(오버샘플링)
타겟=0으로 레코드 undersampling(과소 샘플링)

물론 더 많은 전략들이 있고 MachineLearningMastery.com은 멋진 개요를 제공한다. 우리는 훈련 세트가 꽤 크기 때문에 undersampling(과소 샘플링)으로 갈 수 있다.

desired_apriori = 0.10

# target value(목표값) 당 지수 가져오기
idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index

# target value(목표값)당 원래 레코드 수 가져오기
nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])

# 언더샘플링 비율 과 타켓이 0 일때 레코드 갯수 결과 계산
undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))


# 목표값=0이 있는 레코드를 임의로 선택하여 원하는 사전 정보를 얻으십시오.
undersampled_idx = shuffle(idx_0, random_state = 37, n_samples=undersampled_nb_0)

#나머지 인덱스를 사용하여 list 구성
idx_list = list(undersampled_idx) + list(idx_1)

# 언더샘플링할 결과 다시 담기 
train = train.loc[idx_list].reset_index(drop=True)

Rate to undersample records with target=0: 0.34043569687437886
Number of records with target=0 after undersampling: 195246

5. 데이터 품질 검사

missing values 검사
- missing value는 -1로 표현함

vars_with_missing=[]

for f in train.columns:
    missings = train[train[f] == -1][f].count()
    if missings > 0:
        vars_with_missing.append(f)
        missings_perc = missings/train.shape[0]
        
        print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))
print('\n In total, there are  {} variables with missing values'.format(len(vars_with_missing)))

Variable ps_ind_02_cat has 103 records (0.05%) with missing values
Variable ps_ind_04_cat has 51 records (0.02%) with missing values
Variable ps_ind_05_cat has 2256 records (1.04%) with missing values
Variable ps_reg_03 has 38580 records (17.78%) with missing values
Variable ps_car_01_cat has 62 records (0.03%) with missing values
Variable ps_car_02_cat has 2 records (0.00%) with missing values
Variable ps_car_03_cat has 148367 records (68.39%) with missing values
Variable ps_car_05_cat has 96026 records (44.26%) with missing values
Variable ps_car_07_cat has 4431 records (2.04%) with missing values
Variable ps_car_09_cat has 230 records (0.11%) with missing values
Variable ps_car_11 has 1 records (0.00%) with missing values
Variable ps_car_14 has 15726 records (7.25%) with missing values

 In total, there are  12 variables with missing values

ps_car_03_cat(68.39%), ps_car_05_cat(44.26%)은 missing value(결측치)비율이 높다. 정확하게 측정할 수 없으므로 변수에서 제거한다.
대부분의 범주형 변수(categorical variable)은 -1로 둔다.
연속형 변수 mean 값으로, 순서형 변수는 mode값으로 대체한다.

# 결측치 비율이 높은 두 항목을 삭제 한다.
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace=True, axis=1)
meta.loc[(vars_to_drop), 'keep'] = False

mean_imp = SimpleImputer(missing_values=-1, strategy='mean')
mode_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')
train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_11'] = mean_imp.fit_transform(train[['ps_car_11']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()
train['ps_car_02_cat'] = mode_imp.fit_transform(train[['ps_car_02_cat']]).ravel()

범주형 변수의 집합원 갯수 체크
- 카디널리티는 변수에서 다른 값의 수를 가리킨다.
- 나중에 범주형 변수에서 더미 변수를 만들 것이기 때문에 많은 구별되는 값을 갖는 변수가 있는지 확인할 필요가 있다.
- 이러한 변수들이 많은 더미 변수를 야기할 수 있기 때문에 다르게 다루어야 한다.

v = meta[(meta.level == 'nominal') & (meta.keep)].index
for f in v:
    dist_values = train[f].value_counts().shape[0]
    print('Variable {} has {} distinct values'.format(f, dist_values))

Variable ps_ind_02_cat has 5 distinct values
Variable ps_ind_04_cat has 3 distinct values
Variable ps_ind_05_cat has 8 distinct values
Variable ps_car_01_cat has 13 distinct values
Variable ps_car_02_cat has 2 distinct values
Variable ps_car_04_cat has 10 distinct values
Variable ps_car_06_cat has 18 distinct values
Variable ps_car_07_cat has 3 distinct values
Variable ps_car_08_cat has 2 distinct values
Variable ps_car_09_cat has 6 distinct values
Variable ps_car_10_cat has 3 distinct values
Variable ps_car_11_cat has 104 distinct values

ps_car_11_cat 는 104개로 유니크한 값이 존재하지만, 여전히 논리적(reasonable)이다.

# 아래 코드는 ps_car_11_cat과 같이 유니크한 값을 데이터를 분리하여 새로운 데이터를 생성해주는 코드이다.
# Script by https://www.kaggle.com/ogrellier
# Code: https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features
def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))

def target_encode(trn_series=None, 
                  tst_series=None, 
                  target=None, 
                  min_samples_leaf=1, 
                  smoothing=1,
                  noise_level=0):
    """
    Smoothing is computed like in the following paper by Daniele Micci-Barreca
    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
    trn_series : training categorical feature as a pd.Series
    tst_series : test categorical feature as a pd.Series
    target : target data as a pd.Series
    min_samples_leaf (int) : minimum samples to take category average into account
    smoothing (int) : smoothing effect to balance categorical average vs prior  
    """ 
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name
    temp = pd.concat([trn_series, target], axis=1)
    # Compute target mean 
    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
    # Compute smoothing
    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
    # Apply average function to all target data
    prior = target.mean()
    # The bigger the count the less full_avg is taken into account
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis=1, inplace=True)
    # Apply averages to trn and tst series
    ft_trn_series = pd.merge(
        trn_series.to_frame(trn_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=trn_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_trn_series.index = trn_series.index 
    ft_tst_series = pd.merge(
        tst_series.to_frame(tst_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=tst_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_tst_series.index = tst_series.index
    return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)

train_encoded, test_encoded = target_encode(train['ps_car_11_cat'], 
                                            test['ps_car_11_cat'],
                                            target=train.target,
                                            min_samples_leaf=100,
                                            smoothing=10,
                                            noise_level=0.01
                                           )

train['ps_car_11_cat_te'] =train_encoded
train.drop('ps_car_11_cat', axis=1, inplace=True)
meta.loc['ps_car_11_cat_te', 'keep'] = False # meta 정보에도 업데이트
test['ps_car_11_cat_te'] =test_encoded
test.drop('ps_car_11_cat', axis=1, inplace=True)

# 원본데이터에도 지워졌으니, meta에서도 삭제 
meta.drop(['ps_car_11_cat'], inplace=True)

	ps_reg_01	ps_reg_02	ps_reg_03	ps_car_12	ps_car_13	ps_car_14	ps_car_15	ps_calc_01	ps_calc_02	ps_calc_03
count	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000
mean	0.610991	0.439184	0.551102	0.379945	0.813265	0.276256	3.065899	0.449756	0.449589	0.449849
std	0.287643	0.404264	0.793506	0.058327	0.224588	0.357154	0.731366	0.287198	0.286893	0.287153
min	0.000000	0.000000	-1.000000	-1.000000	0.250619	-1.000000	0.000000	0.000000	0.000000	0.000000
25%	0.400000	0.200000	0.525000	0.316228	0.670867	0.333167	2.828427	0.200000	0.200000	0.200000
50%	0.700000	0.300000	0.720677	0.374166	0.765811	0.368782	3.316625	0.500000	0.400000	0.500000
75%	0.900000	0.600000	1.000000	0.400000	0.906190	0.396485	3.605551	0.700000	0.700000	0.700000
max	0.900000	1.800000	4.037945	1.264911	3.720626	0.636396	3.741657	0.900000	0.900000	0.900000

	ps_ind_01	ps_ind_03	ps_ind_14	ps_ind_15	ps_car_11	ps_calc_04	ps_calc_05	ps_calc_06	ps_calc_07	ps_calc_08	ps_calc_09	ps_calc_10	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14
count	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000
mean	1.900378	4.423318	0.012451	7.299922	2.346072	2.372081	1.885886	7.689445	3.005823	9.225904	2.339034	8.433590	5.441382	1.441918	2.872288	7.539026
std	1.983789	2.699902	0.127545	3.546042	0.832548	1.117219	1.134927	1.334312	1.414564	1.459672	1.246949	2.904597	2.332871	1.202963	1.694887	2.746652
min	0.000000	0.000000	0.000000	0.000000	-1.000000	0.000000	0.000000	0.000000	0.000000	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	0.000000	5.000000	2.000000	2.000000	1.000000	7.000000	2.000000	8.000000	1.000000	6.000000	4.000000	1.000000	2.000000	6.000000
50%	1.000000	4.000000	0.000000	7.000000	3.000000	2.000000	2.000000	8.000000	3.000000	9.000000	2.000000	8.000000	5.000000	1.000000	3.000000	7.000000
75%	3.000000	6.000000	0.000000	10.000000	3.000000	3.000000	3.000000	9.000000	4.000000	10.000000	3.000000	10.000000	7.000000	2.000000	4.000000	9.000000
max	7.000000	11.000000	4.000000	13.000000	3.000000	5.000000	6.000000	10.000000	9.000000	12.000000	7.000000	25.000000	19.000000	10.000000	13.000000	23.000000

	target	ps_ind_06_bin	ps_ind_07_bin	ps_ind_08_bin	ps_ind_09_bin	ps_ind_10_bin	ps_ind_11_bin	ps_ind_12_bin	ps_ind_13_bin	ps_ind_16_bin	ps_ind_17_bin	ps_ind_18_bin	ps_calc_15_bin	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_19_bin	ps_calc_20_bin
count	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000
mean	0.036448	0.393742	0.257033	0.163921	0.185304	0.000373	0.001692	0.009439	0.000948	0.660823	0.121081	0.153446	0.122427	0.627840	0.554182	0.287182	0.349024	0.153318
std	0.187401	0.488579	0.436998	0.370205	0.388544	0.019309	0.041097	0.096693	0.030768	0.473430	0.326222	0.360417	0.327779	0.483381	0.497056	0.452447	0.476662	0.360295
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000
75%	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	1.000000	1.000000	1.000000	0.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

[kaggle] Porto serqruo safe prediction(Bert Carremans) (2) (0)	2020.09.06
[kaggle] Titanic: Machine Learning from Disaster (4) (0)	2020.09.04
[kaggle] Titanic: Machine Learning from Disaster (3) (0)	2020.09.02

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

춤추는 개발자

[kaggle] Porto serqruo safe prediction(Bert Carremans) (1)

Porto-serqruo-safe-prediction

1. 데이터 확인

2. Metadata

3. 기술 통계(Descriptive Statistics)

4. 불균형 클래스 처리

5. 데이터 품질 검사

'Competition > Kaggle' 카테고리의 다른 글

'Competition/Kaggle'의 다른글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

	role	level	count
0	id	nominal	1
1	input	binary	17
2	input	interval	10
3	input	nominal	14
4	input	ordinal	16
5	target	binary	1

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[kaggle] Porto serqruo safe prediction(Bert Carremans) (1)

Porto-serqruo-safe-prediction

1. 데이터 확인

2. Metadata

3. 기술 통계(Descriptive Statistics)

4. 불균형 클래스 처리

5. 데이터 품질 검사

'Competition > Kaggle' 카테고리의 다른 글

'Competition/Kaggle'의 다른글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역