4. 상관 행렬

상관행렬을 통해 특정 트랜잭션이 사기인지 여부에 큰 영향을 미치는지 확인한다. 그러나 부정 거래와 관련하여 양 또는 음의 상관 관계가 높은 기능을 확인하려면 올바른 subsample을 사용하는 것이 중요하다. subsample을 이용하지 않으면 상관 행렬이 클래스 간의 높은 불균형에 의해 영향을 받게 된다. 이 문제는 원래 데이터 프레임의 높은 클래스 불균형 때문에 발생한다.

F, (ax1, ax2) = plt.subplots(2,1, figsize=(24,20))

corr = df.corr()
sns.heatmap(corr, cmap='coolwarm_r', annot_kws={'size':20}, ax=ax1)
ax1.set_title("Imbalanced Colrrelation Matrix \n (don't use for reference)", fontsize=14)

sub_sample_corr = new_df.corr()
sns.heatmap(sub_sample_corr, cmap='coolwarm_r', annot_kws={'size':20},ax=ax2 )
ax2.set_title('SubSample Correlation Matrix \n (use for reference)', fontsize=14)
plt.show()

'Class'의 상관행렬 분석 결과

- 음의 상관 : V17, V14, V12, V10, 이러한 값이 낮을 수록 최종 결과는 부정 거래일 가능성이 높음.
- 양의 상관 : V2, V4, V11, V19 값이 높을 수록 부정 거래 가능성이 높음.

f, axes = plt.subplots(ncols=4, figsize = (20,4))
sns.boxplot(x="Class", y="V17", data=new_df, palette=colors, ax=axes[0])
axes[0].set_title('V17 vs Class Negative Correlation')

sns.boxplot(x="Class", y="V14", data=new_df, palette=colors, ax=axes[1])
axes[1].set_title('V14 vs Class Negative Correlation')

sns.boxplot(x="Class", y="V12", data=new_df, palette=colors, ax=axes[2])
axes[2].set_title('V12 vs Class Negative Correlation')

sns.boxplot(x="Class", y="V10", data=new_df, palette=colors, ax=axes[3])
axes[3].set_title('V10 vs Class Negative Correlation')

plt.show()

f, axes = plt.subplots(ncols=4, figsize = (20,4))
sns.boxplot(x="Class", y="V11", data=new_df, palette=colors, ax=axes[0])
axes[0].set_title('V11 vs Class Negative Correlation')

sns.boxplot(x="Class", y="V4", data=new_df, palette=colors, ax=axes[1])
axes[1].set_title('V4 vs Class Negative Correlation')

sns.boxplot(x="Class", y="V1", data=new_df, palette=colors, ax=axes[2])
axes[2].set_title('V1 vs Class Negative Correlation')

sns.boxplot(x="Class", y="V19", data=new_df, palette=colors, ax=axes[3])
axes[3].set_title('V19 vs Class Negative Correlation')

plt.show()

4.1 이상 감지(Anomaly Detection)

이상감지에서 주요 목표는 클래스와의 상관관계가 높은 변수에서 'outlier'를 제거하느 ㄴ것이다. 이것은 모델 정확성에 긍정적인 영향을 미친다.

사분위간 범위 방법
- 사분위간 범위(IQR): 75번째 백분위수와 25번째 백분위수의 차이로 계산한다. 목표는 75번째 및 25번째 백분위수 이상의 임계값을 만드는 것이며, 어떤 인스턴스가 이 임계값을 통과할 경우 인스턴스가 삭제될 것이다.
- 상자 그림: 25번째 백분위수와 75번째 백분위수(양쪽 끝)를 쉽게 볼 수 있을 뿐만 아니라 특이치(하한과 높은 극단값을 초과하는 점)도 쉽게 볼 수 있습니다.
특이치 제거 트레이드오프 특이치를 제거하기 위한 임계값은 어디까지로 지정할지 주의해야 한다. 숫자(예: 1.5)에 사분위간 범위(사분위간 범위)를 곱하여 임계값을 결정한다. 이 임계값이 높을수록 특이치가 더 적게 탐지되고(더 높은 수의 ex: 3) 이 임계값은 더 많은 특이치를 탐지합니다.
- 트레이드오프: 그러나 문턱값이 낮을수록 특이치가 더 많이 제거되기 때문에 특이치보다는 "극한 특이치"에 더 초점을 맞추려고 한다. 그 이유는 모형의 정확도가 낮아지는 정보 손실 위험을 실행할 수 있기 때문이다. 이 임계값을 사용하여 분류 모델의 정확도에 어떤 영향을 미치는지 확인할 수 있다.

from scipy.stats import norm

f, (ax1, ax2, ax3) = plt.subplots(1,3,figsize = (20,6))

v14_fraud_dist = new_df['V14'].loc[new_df['Class'] ==1 ].values
sns.distplot(v14_fraud_dist, ax=ax1, fit=norm, color='#FB8861')
ax1.set_title('V14 Distribution \n (Fraud Transactions)', fontsize=14)

v12_fraud_dist = new_df['V12'].loc[new_df['Class'] ==1 ].values
sns.distplot(v12_fraud_dist, ax=ax2, fit=norm, color='#56F9BB')
ax2.set_title('V12 Distribution \n (Fraud Transactions)', fontsize=14)

v10_fraud_dist = new_df['V10'].loc[new_df['Class'] ==1 ].values
sns.distplot(v10_fraud_dist, ax=ax3, fit=norm, color='#C5B2F9')
ax3.set_title('V10 Distribution \n (Fraud Transactions)', fontsize=14)

Text(0.5, 1.0, 'V10 Distribution \n (Fraud Transactions)')

요약
- 분포 시각화: 먼저 일부 특이치를 제거하기 위해 사용할 형상의 분포를 시각화하는 것으로 시작한다. 변수 V12 및 V10에 비해 가우스 분포가 있는 기능은 V14뿐이다.
- 임계값 결정: IQR(더 낮은 특이치를 제거할수록)과 곱하기 위해 사용할 숫자를 결정한 후에는 q25 - 임계값(더 낮은 극단 임계값)을 하위로 지정하고 q75 + 임계값(상위 극단 임계값)을 추가하여 상한 및 하한 임계값을 계속 결정한다.
- 조건부 드롭: 마지막으로, 양쪽 극단에서 "임계값"을 초과하면 인스턴스가 제거된다는 것을 나타내는 조건부 드롭을 만든다.
- 상자 그림 표현: 상자 그림을 통해 "극한 특이치"의 수가 상당한 양으로 감소했음을 시각화한다.

# V14 특이치 제거하기 (가장 높은 Negative 상관 행렬을 가진 변수)

v14_fraud = new_df['V14'].loc[new_df['Class'] == 1].values
q25, q75 = np.percentile(v14_fraud, 25), np.percentile(v14_fraud, 75)
print('Quartile 25 : {} | Quartile 75: {}'.format(q25, q75))
v14_iqr = q75 - q25
print('iqr: {}'.format(v14_iqr))

v14_cut_off = v14_iqr * 1.5
v14_lower, v14_upper  = q25 - v14_cut_off, q75 + v14_cut_off
print('Cut off: {}'.format(v14_cut_off))
print('V14 Lower: {}'.format(v14_lower))

outliers = [x for x in v14_fraud if x < v14_lower or x> v14_upper]
print('Feature V14 Outliers for Fraud Cases: {}'.format(len(outliers)))

print('V14 outliers:{}'.format(outliers))

new_df = new_df.drop(new_df[(new_df['V14'] > v14_upper) | (new_df['V14'] < v14_lower)].index)
print('----'*44)

Quartile 25 : -9.692722964972385 | Quartile 75: -4.282820849486866
iqr: 5.409902115485519
Cut off: 8.114853173228278
V14 Lower: -17.807576138200663
Feature V14 Outliers for Fraud Cases: 4
V14 outliers:[-18.8220867423816, -19.2143254902614, -18.049997689859396, -18.4937733551053]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# V12 특이치 제거하기 

V12_fraud = new_df['V12'].loc[new_df['Class'] == 1].values
q25, q75 = np.percentile(V12_fraud, 25), np.percentile(V12_fraud, 75)
print('Quartile 25 : {} | Quartile 75: {}'.format(q25, q75))
V12_iqr = q75 - q25
print('iqr: {}'.format(V12_iqr))

V12_cut_off = V12_iqr * 1.5
V12_lower, V12_upper  = q25 - V12_cut_off, q75 + V12_cut_off
print('Cut off: {}'.format(V12_cut_off))
print('V12 Lower: {}'.format(V12_lower))

outliers = [x for x in V12_fraud if x < V12_lower or x> V12_upper]
print('Feature V12 Outliers for Fraud Cases: {}'.format(len(outliers)))

print('V12 outliers:{}'.format(outliers))

new_df = new_df.drop(new_df[(new_df['V12'] > V12_upper) | (new_df['V12'] < V12_lower)].index)
print('Number of Instances after outliers removal: {}'.format(len(new_df)))
print('----'*44)

Quartile 25 : -8.67303320439115 | Quartile 75: -2.893030568676315
iqr: 5.780002635714835
Cut off: 8.670003953572252
V12 Lower: -17.3430371579634
Feature V12 Outliers for Fraud Cases: 4
V12 outliers:[-18.683714633344298, -18.047596570821604, -18.553697009645802, -18.4311310279993]
Number of Instances after outliers removal: 975
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# V10 특이치 제거하기 

V10_fraud = new_df['V10'].loc[new_df['Class'] == 1].values
q25, q75 = np.percentile(V10_fraud, 25), np.percentile(V10_fraud, 75)
print('Quartile 25 : {} | Quartile 75: {}'.format(q25, q75))
V10_iqr = q75 - q25
print('iqr: {}'.format(V10_iqr))

V10_cut_off = V10_iqr * 1.5
V10_lower, V10_upper  = q25 - V10_cut_off, q75 + V10_cut_off
print('Cut off: {}'.format(V10_cut_off))
print('V10 Lower: {}'.format(V10_lower))

outliers = [x for x in V10_fraud if x < V10_lower or x> V10_upper]
print('Feature V10 Outliers for Fraud Cases: {}'.format(len(outliers)))

print('V10 outliers:{}'.format(outliers))

new_df = new_df.drop(new_df[(new_df['V10'] > V10_upper) | (new_df['V10'] < V10_lower)].index)
print('Number of Instances after outliers removal: {}'.format(len(new_df)))
print('----'*44)

Quartile 25 : -7.466658535821848 | Quartile 75: -2.5118611381562523
iqr: 4.954797397665596
Cut off: 7.4321960964983935
V10 Lower: -14.89885463232024
Feature V10 Outliers for Fraud Cases: 27
V10 outliers:[-14.9246547735487, -16.7460441053944, -24.403184969972802, -15.1237521803455, -15.346098846877501, -22.1870885620007, -15.2399619587112, -16.6496281595399, -15.563791338730098, -14.9246547735487, -24.5882624372475, -16.3035376590131, -19.836148851696, -22.1870885620007, -22.1870885620007, -16.6011969664137, -17.141513641289198, -15.2399619587112, -18.9132433348732, -18.2711681738888, -15.124162814494698, -23.2282548357516, -20.949191554361104, -22.1870885620007, -16.2556117491401, -15.2318333653018, -15.563791338730098]
Number of Instances after outliers removal: 945
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

참고: 특이치 감소를 구현한 후 정확도가 3% 이상 향상되었다. 일부 특이치는 모형의 정확도를 왜곡할 수 있지만, 우리는 과도한 양의 정보 손실을 방지해야 하며 그렇지 않으면 모형이 과소 적합될 위험이 있다는 점을 기억해야 한다.

f, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(20,6))
colors = ['#B3F9C5', '#f9c5b3']

#Boxplots with outliers removed
# V14 변수
sns.boxplot(x="Class", y="V14", data=new_df, ax=ax1, palette=colors)
ax1.set_title("V14 Feature \n Reduction of outliers", fontsize=14)
ax1.annotate('Fewer extreme \n outliers ', xy=(0.98, -17.5), xytext=(0, -12),
            arrowprops=dict(facecolor='black'), fontsize=14)

# V12 변수
sns.boxplot(x="Class", y="V12", data=new_df, ax=ax2, palette=colors)
ax2.set_title("V12 Feature \n Reduction of outliers", fontsize=14)
ax2.annotate('Fewer extreme \n outliers ', xy=(0.98, -17.3), xytext=(0, -12),
            arrowprops=dict(facecolor='black'), fontsize=14)

# V10 변수
sns.boxplot(x="Class", y="V10", data=new_df, ax=ax3, palette=colors)
ax3.set_title("V10 Feature \n Reduction of outliers", fontsize=14)
ax3.annotate('Fewer extreme \n outliers ', xy=(0.95, -16.5), xytext=(0, -12),
            arrowprops=dict(facecolor='black'), fontsize=14)

plt.show()

4.2 Dimensionality Reduction and Clustering(t-SNE)

t-SNE는 각 데이터 포인트를 2차원에 무작위로 표현한 후 원본 특성 공간에서 가가운 포인트는 가깝게, 멀리 떨어진 포인트는 멀어지게 만든다. 즉, 이웃 데이터 포인트에 대한 정보를 보존하려고 노렴함.
t-SNE 알고리즘을 이해하려면 다음 용어를 이해해야 한다.
- 유클리드 거리
- 조건부 확률
- 정규 분포도 및 T 분포도

summary
- t-SNE 알고리즘은 데이터 세트에서 부정 행위 및 비사기 사례들을 상당히 정확하게 클러스터링할 수 있다.
- 하위 샘플은 매우 작지만, t-SNE 알고리듬은 모든 시나리오에서 클러스터를 꽤 정확하게 감지할 수 있다(t-SNE를 실행하기 전에 데이터 세트를 섞는다).
- 이는 추가적인 예측 모델이 사기 사례와 비사기 사례를 분리하는 데 상당히 좋은 성과를 거둘 것임을 시사한다.

# New_df is from the random undersample data (fewer instances)
X = new_df.drop('Class', axis=1)
y = new_df['Class']

t0 = time.time()
X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(X.values)
t1 = time.time()
print("T-SNE took {:.2} s".format(t1-t0))

# PCA Implementation
t0 = time.time()
X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values)
t1 = time.time()
print("PCA took {:.2} s".format(t1-t0))

# PCA Implementation
t0 = time.time()
X_reduced_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=42).fit_transform(X.values)
t1 = time.time()
print("Truncated SVD took {:.2} s".format(t1-t0))

T-SNE took 5.9 s
PCA took 0.026 s
Truncated SVD took 0.005 s

f, (ax1, ax2, ax3 ) = plt.subplots(1,3, figsize=(24,6))
f.suptitle('Clustering using Dimensionality Reduction', fontsize = 14)

blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud')
red_patch = mpatches.Patch(color='#AF0000', label='Fraud')
color_num = 2

# t-SNE scatter plot
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y==0), cmap= plt.cm.coolwarm, label='No Fraud', linewidths=2)
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y==1), cmap= plt.cm.coolwarm, label='Fraud', linewidths=2)
ax1.set_title('t-SNE', fontsize=14)

ax1.grid(True)
ax1.legend(handles =[blue_patch, red_patch])

# PCA scatter plot
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y==0), cmap= plt.cm.coolwarm, label='No Fraud', linewidths=2)
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y==1), cmap=plt.cm.coolwarm, label='Fraud', linewidths=2)
ax2.set_title('PCA', fontsize=14)

ax2.grid(True)
ax2.legend(handles =[blue_patch, red_patch])

# TruncatedSVD scatter plot
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y==0), cmap= plt.cm.coolwarm, label='No Fraud', linewidths=2)
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y==1), cmap=plt.cm.coolwarm, label='Fraud', linewidths=2)
ax3.set_title('Truncated SVD', fontsize=14)

ax3.grid(True)
ax3.legend(handles =[blue_patch, red_patch])

plt.show()

4.3 분류(Classifiers)

분류기(언더샘플링): 이 섹션에서는 네 가지 유형의 Classifiers를 교육하고, 어떤 Classifier가 부정 거래를 탐지하는 데 더 효과적인지 결정한다. 그전에 우리는 데이터를 교육 데이터 세트와 테스트 데이터 세트로 나누고 레이블로 부터 변수를 분리해야 한다.

Summary
- 로지스틱 회귀 분류기 : 대부분의 경우 다른 세 Classifier 보다 정확하다. (로지스틱 회귀 분석 추가)
- GridSearchCV : Classifier 대해 최상의 예측 점수를 제공하는 매개 변수를 결정하는 데 사용된다.
- 로지스틱 회귀 분석에서는 가장 좋은 ROC(수신 작동 특성 점수)가 있다. 이 ROC 기법은 로지스틱 회귀 분석에서는 부정 행위와 부정 행위 이외의 트랜잭션을 상당히 정확하게 구분한다.
참고
- training score와 cross validation score의 간격이 넓을수록 모형이 과대 적합될 가능성이 높습니다 (고분산).
- training score와 cross validation score 모두 점수가 낮은 경우 과소 적합할 가능성이 높다.(높은 편향).
- 로지스틱 회귀 분석 Classifier 는 교육 세트와 교차 검증 세트 모두에서 가장 높은 점수를 표시합니다.

# Undersampling before cross validating (prone to overfit)
X = new_df.drop('Class', axis = 1)
y = new_df['Class']

# 데이터는 이미 training, test set으로 나눠져 있다. 
from sklearn.model_selection import train_test_split

# 딱봐도 undersampling이다.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=42)

# 이 데이터로 알고리즘을 돌려보자.
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

# 이제 간단하게 classifiers를 실행시켜보자.
classifier = {
    "LogisiticRegression": LogisticRegression(),
    "KNearest": KNeighborsClassifier(),
    "Support Vector Classifier": SVC(),
    "DecisionTreeClassifier": DecisionTreeClassifier()
}

# 아마 cross validation 을 적용하면 높은 점수가 나올 것이다. 
from sklearn.model_selection import cross_val_score

for key, classifier in classifier.items():
    classifier.fit(X_train, y_train)
    training_score = cross_val_score(classifier, X_train, y_train, cv=5)
    print("Classifiers: ", classifier.__class__.__name__, "Has a training socre of ", round(training_score.mean(), 2)*100, "% accuracy score")

Classifiers:  LogisticRegression Has a training socre of  95.0 % accuracy score
Classifiers:  KNeighborsClassifier Has a training socre of  93.0 % accuracy score
Classifiers:  SVC Has a training socre of  93.0 % accuracy score
Classifiers:  DecisionTreeClassifier Has a training socre of  90.0 % accuracy score

# GridSearchCV를 사용하여 best parameter를 찾아보자.

from sklearn.model_selection import GridSearchCV

log_reg_params = {
    "penalty":['l1', 'l2'],
    "C" : [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(X_train, y_train)

log_reg = grid_log_reg.best_estimator_

knears_params = {
    "n_neighbors" : list(range(2,5,1)),
    'algorithm' : ['auto', 'ball_tree', 'kd_tree','brute']
}

grid_knears = GridSearchCV(KNeighborsClassifier(), knears_params)
grid_knears.fit(X_train, y_train)

knears_neighbors = grid_knears.best_estimator_

svc_params = {
    'C': [0.5, 0.7, 0.9, 1],
    'kernel': ['rbf', 'poly', 'sigmoid', 'linear']
}

grid_svc= GridSearchCV(SVC(), svc_params)
grid_svc.fit(X_train, y_train)

svc = grid_svc.best_estimator_

tree_params = {
    "criterion":["gini", "entropy"],
    "max_depth":list(range(2,4,1)),
    "min_samples_leaf":list(range(5,7,1))
}

grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params)
grid_tree.fit(X_train, y_train)

tree_clf = grid_tree.best_estimator_

#Overfitting Case

log_reg_score = cross_val_score(log_reg, X_train, y_train, cv=5)
print('Logistic Regression Cross Validation Score: ', round(log_reg_score.mean() *100, 2).astype(str) + '%')

knears_score = cross_val_score(knears_neighbors, X_train, y_train, cv=5)
print('Knears Neighbors Cross Validation Score: ', round(knears_score.mean() *100, 2).astype(str) + '%')

svc_score = cross_val_score(svc, X_train, y_train, cv=5)
print('Support Vector Classifier Cross Validation Score: ', round(svc_score.mean() *100, 2).astype(str) + '%')

tree_score = cross_val_score(tree_clf, X_train, y_train, cv=5)
print('DecisionTree Classifier Cross Validation Score: ', round(tree_score.mean() *100, 2).astype(str) + '%')

Logistic Regression Cross Validation Score:  94.84%
Knears Neighbors Cross Validation Score:  93.25%
Support Vector Classifier Cross Validation Score:  94.58%
DecisionTree Classifier Cross Validation Score:  92.73%

# 우리는 cross validation을 하는 동안 undersampling을 할 것이다. 
undersample_X = df.drop('Class', axis=1)
undersample_y = df['Class']

for train_index, test_index in sss.split (undersample_X, undersample_y):
    print("Train:", train_index, "Test: ", test_index)
    undersample_Xtrain, undersample_Xtest = undersample_X.iloc[train_index], undersample_X.iloc[test_index]
    undersample_ytrain, undersample_ytest = undersample_y.iloc[train_index], undersample_y.iloc[test_index]
    
undersample_Xtrain = undersample_Xtrain.values
undersample_Xtest = undersample_Xtest.values
undersample_ytrain = undersample_ytrain.values
undersample_ytest = undersample_ytest.values

Train: [ 56961  56962  56963 ... 284804 284805 284806] Test:  [    0     1     2 ... 56959 56960 57237]
Train: [     0      1      2 ... 284804 284805 284806] Test:  [ 56961  56962  56963 ... 114130 114304 114409]
Train: [     0      1      2 ... 284804 284805 284806] Test:  [113920 113921 113922 ... 170890 170891 170892]
Train: [     0      1      2 ... 284804 284805 284806] Test:  [168199 168523 168652 ... 227851 227852 227853]
Train: [     0      1      2 ... 227851 227852 227853] Test:  [224710 225343 225627 ... 284804 284805 284806]

undersample_accuracy = []
undersample_precision = []
undersample_recall = []
undersample_f1 = []
undersample_auc = []

# NearMiss 기술 실행
# NearMiss 부산
X_nearmiss, y_nearmiss = NearMiss().fit_sample(undersample_X.values, undersample_y.values)
print('NearMiss Label Distribution: {}'.format(Counter(y_nearmiss)))

NearMiss Label Distribution: Counter({0: 492, 1: 492})

# 올바른 방법으로 cross Validation 하기
for train, test in sss.split (undersample_Xtrain, undersample_ytrain):
    undersample_pipeline = imbalanced_make_pipeline(NearMiss(sampling_strategy='majority'), log_reg)
    undersample_model = undersample_pipeline.fit(undersample_Xtrain[train], undersample_ytrain[train])
    undersample_prediction = undersample_model.predict(undersample_Xtrain[test])
    
    undersample_accuracy.append(undersample_pipeline.score(original_Xtrain[test], original_ytrain[test]))
    undersample_precision.append(precision_score(original_ytrain[test], undersample_prediction))
    undersample_recall.append(recall_score(original_ytrain[test], undersample_prediction))
    undersample_f1.append(f1_score(original_ytrain[test], undersample_prediction))
    undersample_auc.append(roc_auc_score(original_ytrain[test], undersample_prediction))

# 이제 LogisticRegression Learning Curve을 plot으로 그려보자.

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator1, estimator2, estimator3, estimator4, X, y, ylim = None, cv = None,
                       n_jobs =1, train_sizes = np.linspace(.1, 1.0, 5)):
    f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2,figsize=(20,14), sharey=True)
    if ylim is not None:
        plt.ylim(*ylim)
        
    train_sizes, train_scores, test_scores = learning_curve(
        estimator1, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    ax1.fill_between(train_sizes, train_scores_mean - train_scores_std,
                    train_scores_mean + train_scores_std, alpha = 0.1,
                    color="#ff9124")
    ax1.fill_between(train_sizes, test_scores_mean - test_scores_std,
                    test_scores_mean + test_scores_std, alpha = 0.1,
                    color="#2492ff")
    
    ax1.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124", label = "Training score")
    ax1.plot(train_sizes, test_scores_mean, 'o-', color="#ff9124", label = "Cross-Validation score")
    ax1.set_title("Logistic Regression Learning Curve", fontsize = 14)
    ax1.set_xlabel('Training size (m)')
    ax1.set_ylabel('Score')
    ax1.grid(True)
    ax1.legend(loc="best")
    
    # Second Estimator
    train_sizes, train_scores, test_scores = learning_curve(
        estimator2, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    ax2.fill_between(train_sizes, train_scores_mean - train_scores_std,
                    train_scores_mean + train_scores_std, alpha = 0.1,
                    color="#ff9124")
    ax2.fill_between(train_sizes, test_scores_mean - test_scores_std,
                    test_scores_mean + test_scores_std, alpha = 0.1,
                    color="#2492ff")
    
    ax2.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124", label = "Training score")
    ax2.plot(train_sizes, test_scores_mean, 'o-', color="#ff9124", label = "Cross-Validation score")
    ax2.set_title("Knears Neighbors Learning Curve", fontsize = 14)
    ax2.set_xlabel('Training size (m)')
    ax2.set_ylabel('Score')
    ax2.grid(True)
    ax2.legend(loc="best")
    
    #Third Estimator
    train_sizes, train_scores, test_scores = learning_curve(
        estimator3, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    ax3.fill_between(train_sizes, train_scores_mean - train_scores_std,
                    train_scores_mean + train_scores_std, alpha = 0.1,
                    color="#ff9124")
    ax3.fill_between(train_sizes, test_scores_mean - test_scores_std,
                    test_scores_mean + test_scores_std, alpha = 0.1,
                    color="#2492ff")
    
    ax3.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124", label = "Training score")
    ax3.plot(train_sizes, test_scores_mean, 'o-', color="#ff9124", label = "Cross-Validation score")
    ax3.set_title("Support Vector Classifier \n Learning Curve", fontsize = 14)
    ax3.set_xlabel('Training size (m)')
    ax3.set_ylabel('Score')
    ax3.grid(True)
    ax3.legend(loc="best")
    
    #Fourth Estimator
    train_sizes, train_scores, test_scores = learning_curve(
        estimator4, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    ax4.fill_between(train_sizes, train_scores_mean - train_scores_std,
                    train_scores_mean + train_scores_std, alpha = 0.1,
                    color="#ff9124")
    ax4.fill_between(train_sizes, test_scores_mean - test_scores_std,
                    test_scores_mean + test_scores_std, alpha = 0.1,
                    color="#2492ff")
    
    ax4.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124", label = "Training score")
    ax4.plot(train_sizes, test_scores_mean, 'o-', color="#ff9124", label = "Cross-Validation score")
    ax4.set_title("Support Vector Classifier \n Learning Curve", fontsize = 14)
    ax4.set_xlabel('Training size (m)')
    ax4.set_ylabel('Score')
    ax4.grid(True)
    ax4.legend(loc="best")
    
    return plt

cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)
plot_learning_curve(log_reg, knears_neighbors, svc, tree_clf, X_train, y_train, (0.87, 1.01), cv=cv, n_jobs=4)

<module 'matplotlib.pyplot' from 'c:\\users\\hanbit\\appdata\\local\\programs\\python\\python37\\lib\\site-packages\\matplotlib\\pyplot.py'>

training size가 커지면 커질수록 trainig score, cross-valication score는 비슷해져간다.

이제 ROC 곡선으로 보자.

from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict

log_reg_pred = cross_val_predict(log_reg, X_train, y_train, cv =5, method="decision_function")
knears_pred = cross_val_predict(knears_neighbors, X_train, y_train, cv =5)
svc_pred = cross_val_predict(svc, X_train, y_train, cv =5, method="decision_function")
tree_pred = cross_val_predict(tree_clf,  X_train, y_train, cv =5)

from sklearn.metrics import roc_auc_score

print('Logistic Regression: ', roc_auc_score(y_train, log_reg_pred))
print('KNears Neighbors: ', roc_auc_score(y_train, knears_pred))
print('Support Vector Classifier: ', roc_auc_score(y_train, svc_pred))
print('Decision Tree Classifier: ', roc_auc_score(y_train, tree_pred))

Logistic Regression:  0.9791635952626665
KNears Neighbors:  0.9291611381394663
Support Vector Classifier:  0.9770364286065303
Decision Tree Classifier:  0.9245908889871738

# plot으로 표현해보자.

log_fpr, log_tpr, log_thresold = roc_curve(y_train, log_reg_pred)
knear_fpr, knear_tpr, knear_thresold = roc_curve(y_train, knears_pred)
svc_fpr, svc_tpr, svc_thresold = roc_curve(y_train, svc_pred)
tree_fpr, tree_tpr, tree_thresold = roc_curve(y_train, tree_pred)

def graph_roc_curve_multiple(log_fpr, log_tpr, knear_fpr, knear_tpr, svf_fpr, svc_tpr, tree_fpr, tree_tpr):
    plt.figure(figsize=(16,8))
    plt.title('ROC Curve \n Top 4 Classifiers', fontsize=18)
    plt.plot(log_fpr, log_tpr, label= 'Logistic Regression Classifier Score: {:.4f}'.format(roc_auc_score(y_train, log_reg_pred)))
    plt.plot(knear_fpr, knear_tpr, label= 'KNear Neighbors Classifier Score: {:.4f}'.format(roc_auc_score(y_train, knears_pred)))    
    plt.plot(svc_fpr, svc_tpr, label= 'Support Vector Classifier Score: {:.4f}'.format(roc_auc_score(y_train, svc_pred)))
    plt.plot(tree_fpr, tree_tpr, label= 'Decision Tree Classifier Score: {:.4f}'.format(roc_auc_score(y_train, tree_pred)))
    
    plt.plot([0, 1], [0,1], 'k--')
    plt.axis([-0.01, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.annotate('Minimum ROC Score of 50% \n (This is the minimum score to get)', xy=(0.5, 0.5), xytext=(0.6,0.3),
                arrowprops = dict(facecolor = '#6E726D', shrink=0.05),)
    plt.legend()

graph_roc_curve_multiple(log_fpr, log_tpr, knear_fpr, knear_tpr, svc_fpr, svc_tpr, tree_fpr, tree_tpr)
plt.show()

4.4 로지스틱 회귀

용어
- True Positives : Correctly Classified Fraud Transactions
- False Positives : Incorrectly Classified Fraud Transactions
- True Negative : Correctly Classified Non-Fraud Transactions
- False Negative : Incorrectly Classified Non-Fraud Transactions
- Precision(정밀도)
  - True Positives / (True Positives + False Positives)
- Recall(재현도)
  - True Positives / (True Positives + False Negative)

 def logistic_roc_curve(log_fpr, log_tpr):
        plt.figure(figsize=(12,8))
        plt.title('Logistic Regression ROC Curve', fontsize = 16)
        plt.plot(log_fpr, log_tpr, 'b-', linewidth = 2)
        plt.plot([0,1], [0,1], 'r--')
        plt.xlabel('False Positive Rate', fontsize=16)
        plt.ylabel('True Positive Rate', fontsize=16)
        plt.axis([-0.01, 1, 0, 1])

요약
- 정밀도(Precision)은 0.90~0.91 사이에서 감소하기 시작한다. 우리의 정밀도 점수는 여전히 높고, 재현율 점수도 감소하고 있다.

logistic_roc_curve(log_fpr, log_tpr)
plt.show()

from sklearn.metrics import precision_recall_curve

precision,recall, threshold = precision_recall_curve(y_train, log_reg_pred)

from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score

y_pred = log_reg.predict(X_train)

# Overfitting Case
print('---'*45)
print('Overfitting: \n')
print('Recall Score: {:.2f}'.format(recall_score(y_train, y_pred)))
print('Precision Score: {:.2f}'.format(precision_score(y_train, y_pred)))
print('F1 Score: {:.2f}'.format(f1_score(y_train, y_pred)))
print('Accuracy Score: {:.2f}'.format(accuracy_score(y_train, y_pred)))
print('---'*45)

# How it should look like
print('---'*45)
print('How it shold be:\n')
print('Accuracy Score: {:.2f}'.format(np.mean(undersample_accuracy)))
print('Precision Score: {:.2f}'.format(np.mean(undersample_precision)))
print('Recall Score: {:.2f}'.format(np.mean(undersample_recall)))
print('F1 Score: {:.2f}'.format(np.mean(undersample_f1)))
print('---'*45)

---------------------------------------------------------------------------------------------------------------------------------------
Overfitting: 

Recall Score: 0.92
Precision Score: 0.82
F1 Score: 0.86
Accuracy Score: 0.86
---------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------
How it shold be:

Accuracy Score: 0.76
Precision Score: 0.00
Recall Score: 0.21
F1 Score: 0.00
---------------------------------------------------------------------------------------------------------------------------------------

undersample_y_score = log_reg.decision_function(original_Xtest)

from sklearn.metrics import average_precision_score

undersample_average_precision = average_precision_score(original_ytest, undersample_y_score)

print('Average precision-recall score: {0:0.2f}'.format(undersample_average_precision))

Average precision-recall score: 0.06

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(12,6))

precision, recall, _ = precision_recall_curve(original_ytest, undersample_y_score)

plt.step(recall, precision, color = '#004a93', alpha=0.2, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='#48a6ff')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('UnderSampling precision-Recall curve: \n Average Precision-Recall Score = {0:0.2f}'.format(
            undersample_average_precision), fontsize=16)

Text(0.5, 1.0, 'UnderSampling precision-Recall curve: \n Average Precision-Recall Score = 0.06')

4.5 SMOTE와 Oversampling

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, RandomizedSearchCV

print('Length of X (train): {} | Length of y (train): {}'.format(len(original_Xtrain), len(original_ytrain)))
print('Length of X (test): {} | Length of y (test): {}'.format(len(original_Xtest), len(original_ytest)))

Length of X (train): 227846 | Length of y (train): 227846
Length of X (test): 56961 | Length of y (test): 56961

accuracy_lst = []
precision_lst = []
recall_lst = []
f1_lst = []
auc_lst = []

log_reg_sm = LogisticRegression()
rand_log_reg = RandomizedSearchCV(LogisticRegression(), log_reg_params, n_iter=4)

log_reg_params = {"penalty":['l1', 'l2'], 'C': [0.001, 0.01, 0,1, 1, 10, 100, 1000]}

for train, test in sss.split(original_Xtrain, original_ytrain):
    pipeline = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), rand_log_reg)
    model = pipeline.fit(original_Xtrain[train], original_ytrain[train])
    best_est = rand_log_reg.best_estimator_
    prediction = best_est.predict(original_Xtrain[test])
    
    accuracy_lst.append(pipeline.score(original_Xtrain[test], original_ytrain[test]))
    precision_lst.append(precision_score(original_ytrain[test], prediction))
    recall_lst.append(recall_score(original_ytrain[test], prediction))
    f1_lst.append(f1_score(original_ytrain[test], prediction))
    auc_lst.append(roc_auc_score(original_ytrain[test], prediction))
    
print('---'*45)    
print('')
print("accuracy: {}".format(np.mean(accuracy_lst)))
print("precision: {}".format(np.mean(precision_lst)))
print("recall: {}".format(np.mean(recall_lst)))
print("f1: {}".format(np.mean(f1_lst)))
print('---'*45)

---------------------------------------------------------------------------------------------------------------------------------------

accuracy: 0.941965851215518
precision: 0.06096599528804144
recall: 0.9137293086660175
f1: 0.11248255949118291
---------------------------------------------------------------------------------------------------------------------------------------

labels = ['No Fraud', 'Fraud']
smote_prediction = best_est.predict(original_Xtest)
print(classification_report(original_ytest, smote_prediction, target_names=labels))

              precision    recall  f1-score   support

    No Fraud       1.00      0.99      0.99     56863
       Fraud       0.11      0.86      0.20        98

    accuracy                           0.99     56961
   macro avg       0.56      0.92      0.60     56961
weighted avg       1.00      0.99      0.99     56961

y_score = best_est.decision_function(original_Xtest)

average_precision = average_precision_score(original_ytest, y_score)
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

Average precision-recall score: 0.75

fig = plt.figure(figsize=(12,6))
precision, recall, _ = precision_recall_curve(original_ytest, y_score)

plt.step(recall, precision, color='r', alpha=0.2, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='#F59B00')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])

plt.title('OverSampling Precision-Recall curve:\n Average Precision-Recall Score={0:0.2f}'.format(average_precision), fontsize=16)

Text(0.5, 1.0, 'OverSampling Precision-Recall curve:\n Average Precision-Recall Score=0.75')

# ratio parameter changes sampling_strategy
sm = SMOTE(sampling_strategy=0.6, random_state=42)

Xsm_train, ysm_train = sm.fit_sample(original_Xtrain, original_ytrain)

t0= time.time()
log_reg_sm = grid_log_reg.best_estimator_
log_reg_sm.fit(Xsm_train, ysm_train)
t1 = time.time()
print("Fitting oversampling data took: {} sec".format(t1-t0))

Fitting oversampling data took: 5.224553108215332 sec

[kaggle][필사] Credit Card Fraud Detection (3) (0)	2021.02.03
[kaggle][필사] Credit Card Fraud Detection (1) (0)	2021.02.01
[kaggle][필사] Spooky Author Identification (0)	2021.01.22

춤추는 개발자

[kaggle][필사] Credit Card Fraud Detection (2)

4. 상관 행렬

4.1 이상 감지(Anomaly Detection)

4.2 Dimensionality Reduction and Clustering(t-SNE)

4.3 분류(Classifiers)

4.4 로지스틱 회귀

4.5 SMOTE와 Oversampling

'Competition > Kaggle' 카테고리의 다른 글

'Competition/Kaggle'의 다른글

티스토리툴바

[kaggle][필사] Credit Card Fraud Detection (2)

4. 상관 행렬

4.1 이상 감지(Anomaly Detection)

4.2 Dimensionality Reduction and Clustering(t-SNE)

4.3 분류(Classifiers)

4.4 로지스틱 회귀

4.5 SMOTE와 Oversampling

'Competition > Kaggle' 카테고리의 다른 글

'Competition/Kaggle'의 다른글

관련글

티스토리툴바