Competition/Kaggle

[kaggle][필사] Credit Card Fraud Detection (3)

bisi 2021. 2. 3. 17:10

 

이번주제는 신용카드 거래가 사기거래인지, 정상거래인지 식별한다.

 

신용카드 회사가 사기 신용카드 거래를 인식 하여 고객이 구매하지 않은 항목에 대해서는 비용이 청구되지 않도록 하는 것이 목표다. 

 

데이터 세트는 2일동안 발생한 거래를 보여주며, 248,807건의 거래중 492건의 사기가 있다. 

데이터 세트는 매우 불균형하며 positive class(Fruad)는 모든 거래의 0.172%를 차지한다.

 

feature 데이터는 기밀 유지 문제로 데이터에 대한 원래 내용과 추가 배경정보는 제공하지 않는다. 

변수명은 V1~ V28로 구성되어, PCA로 한번 가공된 구성요소 이다.

유일하게 변환되지 않은 변수는 '시간'과 '금액'이다. 

타켓 클래스는 응답 변수이며 1이면 사기, 0이면 정상으로 구분한다.

 

필사한 코드는 Janio Martinez Bachmann님의 Credit Fraud || Dealing with Imbalanced Datasets 를 참고했다.

 

총 5가지 부분으로 나눴으며 순서는 아래와 같다. 

 

1) 데이터 이해하기

2) 데이터 전처리

3) 랜덤 UnderSampling and OverSampling

4) 상관행렬

5) 테스팅

 


 

 

 

 

5. 테스팅

5.1 로지스틱 회귀 테스트

 
  • 랜덤 undersampling: 우리는 랜덤 undersampling subset에서 분류모델의 마지막 성능평가할 것이다. 하지만 이것은 original dataframe으로 부터의 데이터가 아님을 명심하라.
  • 분류 모델 : 가장 성능이 좋은 모델은 logistic regression과 SVC(Support Vector Classifier) 였다.
In [62]:
from sklearn.metrics import confusion_matrix

# SMOTE 기법을 사용하여 Logistic Regression에 fit 하기
y_pred_log_reg = log_reg_sm.predict(X_test)
In [63]:
y_pred_knear = knears_neighbors.predict(X_test)
y_pred_svc = svc.predict(X_test)
y_pred_tree = tree_clf.predict(X_test)
In [65]:
log_reg_cf = confusion_matrix(y_test, y_pred_log_reg)
kneighbors_cf = confusion_matrix(y_test, y_pred_knear)
svc_cf = confusion_matrix(y_test, y_pred_svc)
tree_cf = confusion_matrix(y_test, y_pred_tree)

fig, ax = plt.subplots(2,2,figsize=(22,12))

sns.heatmap(log_reg_cf, ax=ax[0][0], annot=True, cmap=plt.cm.copper)
ax[0][0].set_title("Logistic Regression \n Confusion Matrix", fontsize=14)
ax[0][0].set_xticklabels(['',''], fontsize=14, rotation=90)
ax[0][0].set_yticklabels(['',''], fontsize=14, rotation=360)

sns.heatmap(kneighbors_cf, ax=ax[0][1], annot=True, cmap=plt.cm.copper)
ax[0][1].set_title("KNearsNeighbors \n Confusion Matrix", fontsize=14)
ax[0][1].set_xticklabels(['',''], fontsize=14, rotation=90)
ax[0][1].set_yticklabels(['',''], fontsize=14, rotation=360)

sns.heatmap(log_reg_cf, ax=ax[1][0], annot=True, cmap=plt.cm.copper)
ax[1][0].set_title("Support Vector Classifier \n Confusion Matrix", fontsize=14)
ax[1][0].set_xticklabels(['',''], fontsize=14, rotation=90)
ax[1][0].set_yticklabels(['',''], fontsize=14, rotation=360)

sns.heatmap(log_reg_cf, ax=ax[1][1], annot=True, cmap=plt.cm.copper)
ax[1][1].set_title("Logistic Regression \n Confusion Matrix", fontsize=14)
ax[1][1].set_xticklabels(['',''], fontsize=14, rotation=90)
ax[1][1].set_yticklabels(['',''], fontsize=14, rotation=360)
Out[65]:
[Text(0, 0.5, ''), Text(0, 1.5, '')]
 
In [66]:
from sklearn.metrics import classification_report

print('Logistic Regression:')
print(classification_report(y_test, y_pred_log_reg))

print('KNears Neighbors:')
print(classification_report(y_test, y_pred_knear))

print('Support Vector Classifier:')
print(classification_report(y_test, y_pred_svc))

print('Tree Classifier:')
print(classification_report(y_test, y_pred_tree))
 
Logistic Regression:
              precision    recall  f1-score   support

           0       0.89      0.98      0.93        89
           1       0.98      0.89      0.93       100

    accuracy                           0.93       189
   macro avg       0.93      0.93      0.93       189
weighted avg       0.94      0.93      0.93       189

KNears Neighbors:
              precision    recall  f1-score   support

           0       0.87      0.99      0.93        89
           1       0.99      0.87      0.93       100

    accuracy                           0.93       189
   macro avg       0.93      0.93      0.93       189
weighted avg       0.93      0.93      0.93       189

Support Vector Classifier:
              precision    recall  f1-score   support

           0       0.89      0.94      0.92        89
           1       0.95      0.90      0.92       100

    accuracy                           0.92       189
   macro avg       0.92      0.92      0.92       189
weighted avg       0.92      0.92      0.92       189

Tree Classifier:
              precision    recall  f1-score   support

           0       0.85      0.97      0.91        89
           1       0.97      0.85      0.90       100

    accuracy                           0.90       189
   macro avg       0.91      0.91      0.90       189
weighted avg       0.91      0.90      0.90       189

In [67]:
# logsitic regression의 테스트 셋의 final score 
from sklearn.metrics import accuracy_score

y_pred = log_reg.predict(X_test)
undersample_score = accuracy_score(y_test, y_pred)
In [71]:
# SMOTE 기술과 Logistic Regression

y_pred_sm = best_est.predict(original_Xtest)
oversample_score = accuracy_score(original_ytest, y_pred_sm)

d = {'Technique': ['Random UnderSampling', 'Oversampling (SMOTE)'], 'Score':[undersample_score, oversample_score]}
final_df = pd.DataFrame(data=d)


# Move Column
score = final_df['Score']
final_df.drop('Score', axis = 1, inplace=True)
final_df.insert(1, 'Score', score)

final_df
Out[71]:
  Technique Score
0 Random UnderSampling 0.931217
1 Oversampling (SMOTE) 0.988273
 

점수를 비교한 결과, SMOTE 기법을 적용한 Oversampling 기법이 score가 더 높게 나왔다.

 

5.2 신경망 테스트

  • 이 섹션에서는 UnderSample 또는 OverSample(SMOTE)를 구현한 두가지 Logistic Regression 모델 중 Fraud, Non-Fraud를 탐지하는데 더 나은 모델을 확인하기 위해 간단한 신경망 모델을 구현한다.
  • 우리는 Fraud 데이터에 집중하지 않고 Non-Fraud 거래에 중점을 둘 것이다. 왜냐하면 카드 소지자가 물건을 구입한 후 은행의 알고리즘에 의해 카드가 막혔다고 하면 안되기 때문이다.
  • 신경망 구조(Neural Network Structure)
    • 32개의 노드를 가진 hidden layer 구성
    • outnode는 2개(0 혹은 1)
    • learning late : 0.001
    • optimizer : AdamOptimizer
    • activation function : Relu
    • final output은 sparse categorical cross entropy 적용
      • 이것은 결과가 Fraud, Non-Fraud의 확률을 제공함.
 
  • Keras, RandomUnderSampling
In [72]:
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy

n_inputs = X_train.shape[1]

undersample_model = Sequential([
    Dense(n_inputs, input_shape=(n_inputs, ), activation='relu'),
    Dense(32, activation='relu'),
    Dense(2, activation='softmax')
])
In [73]:
undersample_model.summary()
 
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 30)                930       
_________________________________________________________________
dense_1 (Dense)              (None, 32)                992       
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 66        
=================================================================
Total params: 1,988
Trainable params: 1,988
Non-trainable params: 0
_________________________________________________________________
In [76]:
undersample_model.compile(Adam(lr=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
In [77]:
undersample_model.fit(X_train, y_train, validation_split=0.2, batch_size=25, epochs=20, shuffle=True, verbose=2)
 
Epoch 1/20
25/25 - 0s - loss: 1.0006 - accuracy: 0.6523 - val_loss: 0.3953 - val_accuracy: 0.8026
Epoch 2/20
25/25 - 0s - loss: 0.3127 - accuracy: 0.8874 - val_loss: 0.3401 - val_accuracy: 0.8684
Epoch 3/20
25/25 - 0s - loss: 0.2481 - accuracy: 0.9189 - val_loss: 0.3205 - val_accuracy: 0.8816
Epoch 4/20
25/25 - 0s - loss: 0.2121 - accuracy: 0.9238 - val_loss: 0.3121 - val_accuracy: 0.8816
Epoch 5/20
25/25 - 0s - loss: 0.1821 - accuracy: 0.9354 - val_loss: 0.2986 - val_accuracy: 0.8816
Epoch 6/20
25/25 - 0s - loss: 0.1615 - accuracy: 0.9421 - val_loss: 0.2941 - val_accuracy: 0.8882
Epoch 7/20
25/25 - 0s - loss: 0.1436 - accuracy: 0.9421 - val_loss: 0.2999 - val_accuracy: 0.9079
Epoch 8/20
25/25 - 0s - loss: 0.1314 - accuracy: 0.9520 - val_loss: 0.3027 - val_accuracy: 0.9013
Epoch 9/20
25/25 - 0s - loss: 0.1219 - accuracy: 0.9503 - val_loss: 0.3076 - val_accuracy: 0.9079
Epoch 10/20
25/25 - 0s - loss: 0.1138 - accuracy: 0.9536 - val_loss: 0.3133 - val_accuracy: 0.9211
Epoch 11/20
25/25 - 0s - loss: 0.1062 - accuracy: 0.9570 - val_loss: 0.3280 - val_accuracy: 0.9276
Epoch 12/20
25/25 - 0s - loss: 0.0999 - accuracy: 0.9603 - val_loss: 0.3240 - val_accuracy: 0.9276
Epoch 13/20
25/25 - 0s - loss: 0.0951 - accuracy: 0.9603 - val_loss: 0.3347 - val_accuracy: 0.9276
Epoch 14/20
25/25 - 0s - loss: 0.0929 - accuracy: 0.9636 - val_loss: 0.3394 - val_accuracy: 0.9211
Epoch 15/20
25/25 - 0s - loss: 0.0872 - accuracy: 0.9636 - val_loss: 0.3558 - val_accuracy: 0.9211
Epoch 16/20
25/25 - 0s - loss: 0.0835 - accuracy: 0.9685 - val_loss: 0.3670 - val_accuracy: 0.9211
Epoch 17/20
25/25 - 0s - loss: 0.0798 - accuracy: 0.9702 - val_loss: 0.3746 - val_accuracy: 0.9145
Epoch 18/20
25/25 - 0s - loss: 0.0772 - accuracy: 0.9719 - val_loss: 0.3783 - val_accuracy: 0.9145
Epoch 19/20
25/25 - 0s - loss: 0.0731 - accuracy: 0.9702 - val_loss: 0.3945 - val_accuracy: 0.9145
Epoch 20/20
25/25 - 0s - loss: 0.0689 - accuracy: 0.9735 - val_loss: 0.4000 - val_accuracy: 0.9145
Out[77]:
<tensorflow.python.keras.callbacks.History at 0x1264fc7a3c8>
In [78]:
undersample_prediction = undersample_model.predict(original_Xtest, batch_size=200, verbose=0)
In [79]:
undersample_fraud_predictions = undersample_model.predict_classes(original_Xtest, batch_size=200, verbose=0)
 
WARNING:tensorflow:From <ipython-input-79-4a211ea67a85>:1: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
In [81]:
import itertools

def plot_confusion_matrix(cm, classes, 
                         normalize=False,
                         title = 'Confusion matrix',
                         cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confused matrix")        
    else:
        print('Confusion matrix, without normalization')
        
    print(cm)
    
    plt.imshow(cm, interpolation='nearest', cmap= cmap)
    plt.title(title, fontsize=14)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    fmt = '.2f' if normalize else 'd'
    
    thresh = cm.max() / 2.
    
    for i , j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i,j], fmt), 
                horizontalalignment="center",
                color="white" if cm[i,j] > thresh else "black")
        plt.tight_layout()
        plt.ylabel('True label')
        plt.xlabel('Predicted label')       
        
In [82]:
undersample_cm = confusion_matrix(original_ytest, undersample_fraud_predictions)
actual_cm = confusion_matrix(original_ytest, original_ytest)
labels = ['No Fraud', 'Fraud']
In [83]:
fig = plt.figure(figsize=(16,8))

fig.add_subplot(221)
plot_confusion_matrix(undersample_cm, labels, title="Random UnderSample \n Confusion Matrix",
                     cmap=plt.cm.Reds)
fig.add_subplot(222)
plot_confusion_matrix(actual_cm, labels, title="Confusion Matrix \n (with 100% accuracy)", 
                      cmap=plt.cm.Greens)
 
Confusion matrix, without normalization
[[55158  1705]
 [    9    89]]
Confusion matrix, without normalization
[[56863     0]
 [    0    98]]
 
 
  • Keras, OverSampling(SMOTE)
In [84]:
n_inputs = Xsm_train.shape[1]
oversample_model = Sequential([
    Dense(n_inputs, input_shape=(n_inputs, ), activation = 'relu'),
    Dense(32, activation='relu'),
    Dense(2, activation='softmax')
])
In [85]:
oversample_model.compile(Adam(lr=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
In [86]:
oversample_model.fit(Xsm_train, ysm_train, validation_split=0.2, batch_size = 300,
                    epochs=20, shuffle=True, verbose=2)
 
Epoch 1/20
971/971 - 1s - loss: 0.0770 - accuracy: 0.9715 - val_loss: 0.0788 - val_accuracy: 0.9685
Epoch 2/20
971/971 - 1s - loss: 0.0178 - accuracy: 0.9945 - val_loss: 0.0148 - val_accuracy: 0.9975
Epoch 3/20
971/971 - 1s - loss: 0.0099 - accuracy: 0.9977 - val_loss: 0.0076 - val_accuracy: 0.9993
Epoch 4/20
971/971 - 1s - loss: 0.0070 - accuracy: 0.9985 - val_loss: 0.0049 - val_accuracy: 0.9999
Epoch 5/20
971/971 - 1s - loss: 0.0054 - accuracy: 0.9989 - val_loss: 0.0065 - val_accuracy: 0.9997
Epoch 6/20
971/971 - 1s - loss: 0.0042 - accuracy: 0.9991 - val_loss: 0.0093 - val_accuracy: 0.9994
Epoch 7/20
971/971 - 1s - loss: 0.0037 - accuracy: 0.9993 - val_loss: 0.0062 - val_accuracy: 0.9997
Epoch 8/20
971/971 - 1s - loss: 0.0034 - accuracy: 0.9993 - val_loss: 0.0022 - val_accuracy: 1.0000
Epoch 9/20
971/971 - 1s - loss: 0.0029 - accuracy: 0.9994 - val_loss: 0.0016 - val_accuracy: 1.0000
Epoch 10/20
971/971 - 1s - loss: 0.0023 - accuracy: 0.9995 - val_loss: 0.0018 - val_accuracy: 1.0000
Epoch 11/20
971/971 - 1s - loss: 0.0025 - accuracy: 0.9995 - val_loss: 0.0014 - val_accuracy: 0.9999
Epoch 12/20
971/971 - 1s - loss: 0.0018 - accuracy: 0.9996 - val_loss: 0.0035 - val_accuracy: 0.9994
Epoch 13/20
971/971 - 1s - loss: 0.0021 - accuracy: 0.9995 - val_loss: 0.0015 - val_accuracy: 1.0000
Epoch 14/20
971/971 - 1s - loss: 0.0014 - accuracy: 0.9996 - val_loss: 0.0016 - val_accuracy: 1.0000
Epoch 15/20
971/971 - 1s - loss: 0.0017 - accuracy: 0.9996 - val_loss: 5.8743e-04 - val_accuracy: 1.0000
Epoch 16/20
971/971 - 1s - loss: 0.0016 - accuracy: 0.9996 - val_loss: 0.0022 - val_accuracy: 0.9996
Epoch 17/20
971/971 - 1s - loss: 0.0015 - accuracy: 0.9997 - val_loss: 5.7087e-04 - val_accuracy: 1.0000
Epoch 18/20
971/971 - 1s - loss: 0.0013 - accuracy: 0.9997 - val_loss: 0.0086 - val_accuracy: 0.9975
Epoch 19/20
971/971 - 1s - loss: 0.0010 - accuracy: 0.9997 - val_loss: 5.7633e-04 - val_accuracy: 1.0000
Epoch 20/20
971/971 - 1s - loss: 0.0017 - accuracy: 0.9996 - val_loss: 0.0011 - val_accuracy: 1.0000
Out[86]:
<tensorflow.python.keras.callbacks.History at 0x1265221cec8>
In [90]:
oversample_fraud_predictions = oversample_model.predict_classes(original_Xtest, batch_size=200, verbose=0)
In [92]:
oversample_smote = confusion_matrix(original_ytest, oversample_fraud_predictions)
actual_cm = confusion_matrix(original_ytest, original_ytest)
labels = ['No Fraud', 'Fraud']

fig = plt.figure(figsize=(16,8))

fig.add_subplot(221)
plot_confusion_matrix(oversample_smote, labels, title="OverSample (SMOTE) \n Confusion Matrix", cmap=plt.cm.Oranges)

fig.add_subplot(222)
plot_confusion_matrix(actual_cm, labels, title="Confusion Matrix \n (with 100% accuracy)", cmap=plt.cm.Greens)
 
Confusion matrix, without normalization
[[56842    21]
 [   29    69]]
Confusion matrix, without normalization
[[56863     0]
 [    0    98]]
 
 
  • 결론
    • 불균형 데이터 세트에 SMOTE를 구현하면 레이블의 불균형을 해결할 수 있었다.
    • 때때로 OverSampling data set를 사용한 신경망은 UnderSampling data set를 사용하는 모델보다 정확한 Fraud를 예측한다.
    • 그러나 특이치 제거는 OverSampling된 데이터 집합이 아니라 Random UnderSampling 데이터에서만 실행해야한다.
    • 또한 UnderSampling 데이터 를 사용한 모델에서는 많은 수의 Non-Fraud 트랜잭션에 대해 정확하게 감지 할수 없다. Non-Fraud 거래 또한, Fraud 거래로 잘못 분류할 수 있다.

소스 코드 git hub 바로가기