Competition/Kaggle

[kaggle][필사] Statoil/C-CORE Iceberg Classifier Challenge

bisi 2020. 9. 25. 12:22

이번 필사 주제는  Statoil/C-CORE Iceberg Classifier Challenge 이다. 

 

이미지 데이터로 빙하(iceberg)인지, 선박(ship)인지 구분하는 이진 분류를 사용하는 주제이다. 

 

데이터 설명을 보면, 주어진 데이터는 특정 발생 각도에서 신호 펄스를 전송한 다음 다시 신호 펄스를 재코딩하여 보낸  백스케터( backscatter) 계수라고 한다. 

 

흠.. 데이터 설명부터가 어렵다고 지레 겁먹진 말고, 차근 차근 따라가보자.

 


 

 

 

In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from os.path import join as opj
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pylab
plt.rcParams['figure.figsize'] = 10,10
%matplotlib inline

import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)

# import keras 
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Input, Flatten, Activation
from keras.layers import GlobalAveragePooling2D
from keras.layers.normalization import BatchNormalization
from keras.layers.merge import Concatenate
from keras import initializers
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
 
 

데이터 불러오기

In [3]:
train = pd.read_json('../data/statoil-iceberg-classifier-challenge/train.json')
test = pd.read_json('../data/statoil-iceberg-classifier-challenge/test.json')
 
  • Sentinet -1 sat는 지구 위 약 680Km에 있다. 특정 발생 각도에서 신호 펄스를 전송한 다음 다시 신호 펄스를 재코딩한다.기본적으로 그러한 반사신호를 백스케터라고 부른다. 우리가 받은 데이터는 backscatter 계수이다.
  • ip : 특정 픽셀의 발생 각도
  • ic : 이미 중심에 대한 입사각
  • K : 상수
In [4]:
# RGB와 동등하게 3개의 채널을 만들기 위해 2개의 영역과 2개 영역의 평균으로 채널을 구성한다. 
X_band_1 = np.array([np.array(band).astype(np.float32).reshape(75,75) for band in train ["band_1"]])
X_band_2 = np.array([np.array(band).astype(np.float32).reshape(75,75) for band in train ["band_2"]])                     
X_train = np.concatenate([X_band_1[:, :, :, np.newaxis], X_band_2[:,:,:,np.newaxis], ((X_band_1+X_band_2)/2)[:,:,:,np.newaxis]], axis=-1)
 

차트 확인

In [5]:
# iceberg를 가져온다.
def plotmy3d(c, name):
    data=[
        go.Surface(
            z=c
        )
    ]
    layout = go.Layout(
        title = name,
        autosize= False,
        width=700,
        height=700,
        margin=dict(
            l=65,
            r=50,
            b=65,
            t=90            
        )        
    )
    fig= go.Figure(data=data, layout=layout)
    py.iplot(fig)
plotmy3d(X_band_1[12,:,:], 'iceberg')
    
 
 
  • 레이더 데이터에서 볼 때 빙산의 모양은 위에 보이는 것 처럼, 산과 같다.
  • 이것은 실제 이미지가 아니라 레이더에서 산란하기 때문에 모양은 봉우리나 왜곡이 될 것이다. 배의 모양은 점처럼 될수도 있고, 길쭉한 점처럼 될 수도 있다.
  • 여기서부터 구조적 차이가 발생하고 우리는 CNN을 이용하여 이러한 차이를 이용할 수 있다.
In [6]:
plotmy3d(X_band_1[14, :, :], 'Ship')
 
 

배는 길쭉한 점으로 보인다. 우리는 배의 형상을 시각화할 만한 이미지 해상도가 별로 없다. 그러나 CNN은 이런 것을 돕는다. 그래서 keras를 사용한 CNN으로 구축해보자

keras를 사용한 CNN 구축

In [11]:
def getModel():
    # 모델 세우기
    gmodel = Sequential()
    
    # Conv Layer 1
    gmodel.add(Conv2D(64, kernel_size=(3,3), activation='relu', input_shape=(75,75,3)))
    gmodel.add(MaxPooling2D(pool_size=(3,3), strides=(2,2)))
    gmodel.add(Dropout(0.2))
               
    # Conv Layer 2
    gmodel.add(Conv2D(128, kernel_size=(3,3), activation='relu'))
    gmodel.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
    gmodel.add(Dropout(0.2))
    
    # Conv Layer 3
    gmodel.add(Conv2D(128, kernel_size=(3,3), activation='relu'))
    gmodel.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
    gmodel.add(Dropout(0.2))    
    
    # Conv Layer 4
    gmodel.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
    gmodel.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
    gmodel.add(Dropout(0.2))
    
    # Flatten the data for upcoming dense layers
    gmodel.add(Flatten())
    
    #Dense Layer2
    gmodel.add(Dense(256))
    gmodel.add(Activation('relu'))
    gmodel.add(Dropout(0.2))
    
    #Sigmoid Layer
    gmodel.add(Dense(1))
    gmodel.add(Activation('sigmoid'))
    
    mypotim=Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
    gmodel.compile(
                loss = 'binary_crossentropy',
                optimizer=mypotim,
                metrics=['accuracy'])
    gmodel.summary()
    return gmodel

def get_callbacks(filepath, patience=2):
    es = EarlyStopping('val_loss', patience=patience, mode='min')
    msave = ModelCheckpoint(filepath, save_best_only=True)
    return [es, msave]
In [14]:
file_path = ".model_weights.hdf5"
callbacks = get_callbacks(filepath=file_path, patience=5)
In [16]:
target_train=train['is_iceberg']
X_train_cv, X_valid, y_train_cv, y_valid = train_test_split(X_train, target_train, random_state=1, train_size=0.75)
In [17]:
gmodel = getModel()
gmodel.fit(X_train_cv, y_train_cv, 
          batch_size=24,
          epochs=50,
          verbose=1,
          validation_data=(X_valid, y_valid),
          callbacks=callbacks)
 
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 73, 73, 64)        1792      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 36, 36, 64)        0         
_________________________________________________________________
dropout (Dropout)            (None, 36, 36, 64)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 34, 34, 128)       73856     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 17, 17, 128)       0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 17, 17, 128)       0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 15, 15, 128)       147584    
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 7, 7, 128)         0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 7, 7, 128)         0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 5, 5, 64)          73792     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 2, 2, 64)          0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 2, 2, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 256)               65792     
_________________________________________________________________
activation (Activation)      (None, 256)               0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
=================================================================
Total params: 363,073
Trainable params: 363,073
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
51/51 [==============================] - 14s 282ms/step - loss: 0.8685 - accuracy: 0.5927 - val_loss: 0.5552 - val_accuracy: 0.6933
Epoch 2/50
51/51 [==============================] - 14s 284ms/step - loss: 0.5413 - accuracy: 0.7041 - val_loss: 0.4952 - val_accuracy: 0.7731
Epoch 3/50
51/51 [==============================] - 17s 324ms/step - loss: 0.4788 - accuracy: 0.7548 - val_loss: 0.4735 - val_accuracy: 0.7830
Epoch 4/50
51/51 [==============================] - 20s 399ms/step - loss: 0.4861 - accuracy: 0.7664 - val_loss: 0.5829 - val_accuracy: 0.7007
Epoch 5/50
51/51 [==============================] - 26s 500ms/step - loss: 0.4412 - accuracy: 0.8005 - val_loss: 0.4846 - val_accuracy: 0.7756
Epoch 6/50
51/51 [==============================] - 19s 371ms/step - loss: 0.5362 - accuracy: 0.7249 - val_loss: 0.4224 - val_accuracy: 0.8080
Epoch 7/50
51/51 [==============================] - 20s 393ms/step - loss: 0.4257 - accuracy: 0.7955 - val_loss: 0.4257 - val_accuracy: 0.7955
Epoch 8/50
51/51 [==============================] - 19s 379ms/step - loss: 0.4090 - accuracy: 0.8071 - val_loss: 0.3881 - val_accuracy: 0.8479
Epoch 9/50
51/51 [==============================] - 21s 402ms/step - loss: 0.4275 - accuracy: 0.7847 - val_loss: 0.3983 - val_accuracy: 0.8329
Epoch 10/50
51/51 [==============================] - 26s 501ms/step - loss: 0.3843 - accuracy: 0.8030 - val_loss: 0.3591 - val_accuracy: 0.8504
Epoch 11/50
51/51 [==============================] - 23s 457ms/step - loss: 0.4630 - accuracy: 0.7672 - val_loss: 0.4006 - val_accuracy: 0.8429
Epoch 12/50
51/51 [==============================] - 25s 487ms/step - loss: 0.3804 - accuracy: 0.8113 - val_loss: 0.3799 - val_accuracy: 0.8105
Epoch 13/50
51/51 [==============================] - 22s 426ms/step - loss: 0.3651 - accuracy: 0.8196 - val_loss: 0.3511 - val_accuracy: 0.8454
Epoch 14/50
51/51 [==============================] - 19s 381ms/step - loss: 0.3341 - accuracy: 0.8404 - val_loss: 0.3608 - val_accuracy: 0.8379
Epoch 15/50
51/51 [==============================] - 20s 395ms/step - loss: 0.3463 - accuracy: 0.8470 - val_loss: 0.3423 - val_accuracy: 0.8579
Epoch 16/50
51/51 [==============================] - 22s 427ms/step - loss: 0.3421 - accuracy: 0.8313 - val_loss: 0.3521 - val_accuracy: 0.8404
Epoch 17/50
51/51 [==============================] - 20s 387ms/step - loss: 0.3518 - accuracy: 0.8246 - val_loss: 0.3869 - val_accuracy: 0.8429
Epoch 18/50
51/51 [==============================] - 20s 393ms/step - loss: 0.3184 - accuracy: 0.8562 - val_loss: 0.3752 - val_accuracy: 0.8504
Epoch 19/50
51/51 [==============================] - 21s 418ms/step - loss: 0.3497 - accuracy: 0.8238 - val_loss: 0.3589 - val_accuracy: 0.8454
Epoch 20/50
51/51 [==============================] - 19s 372ms/step - loss: 0.3201 - accuracy: 0.8537 - val_loss: 0.3386 - val_accuracy: 0.8678
Epoch 21/50
51/51 [==============================] - 21s 408ms/step - loss: 0.3041 - accuracy: 0.8529 - val_loss: 0.3037 - val_accuracy: 0.8853
Epoch 22/50
51/51 [==============================] - 20s 400ms/step - loss: 0.2647 - accuracy: 0.8820 - val_loss: 0.3254 - val_accuracy: 0.8603
Epoch 23/50
51/51 [==============================] - 21s 412ms/step - loss: 0.2958 - accuracy: 0.8520 - val_loss: 0.2701 - val_accuracy: 0.8953
Epoch 24/50
51/51 [==============================] - 21s 415ms/step - loss: 0.2634 - accuracy: 0.8803 - val_loss: 0.2922 - val_accuracy: 0.8703
Epoch 25/50
51/51 [==============================] - 20s 395ms/step - loss: 0.2664 - accuracy: 0.8703 - val_loss: 0.2763 - val_accuracy: 0.8903
Epoch 26/50
51/51 [==============================] - 20s 391ms/step - loss: 0.2705 - accuracy: 0.8795 - val_loss: 0.3163 - val_accuracy: 0.8454
Epoch 27/50
51/51 [==============================] - 20s 401ms/step - loss: 0.2492 - accuracy: 0.8828 - val_loss: 0.4021 - val_accuracy: 0.8379
Epoch 28/50
51/51 [==============================] - 19s 378ms/step - loss: 0.2497 - accuracy: 0.8886 - val_loss: 0.2948 - val_accuracy: 0.8728
Out[17]:
<tensorflow.python.keras.callbacks.History at 0x1836173fc48>
In [18]:
gmodel.load_weights(filepath=file_path)
score = gmodel.evaluate(X_valid, y_valid, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
 
13/13 [==============================] - 1s 68ms/step - loss: 0.2701 - accuracy: 0.8953
Test loss: 0.2701273560523987
Test accuracy: 0.895261824131012
 

제출용 데이터 생성

In [20]:
X_band_test_1 = np.array([np.array(band).astype(np.float32).reshape(75,75) for band in test["band_1"]])
X_band_test_2 = np.array([np.array(band).astype(np.float32).reshape(75,75) for band in test["band_2"]])

X_test = np.concatenate([X_band_test_1[:,:,:, np.newaxis], X_band_test_2[:,:,:, np.newaxis],((X_band_test_1+X_band_test_2)/2)[:,:,:,np.newaxis]], axis=-1)
predicted_test = gmodel.predict_proba(X_test)
 
WARNING:tensorflow:From <ipython-input-20-454e97c95156>:5: Sequential.predict_proba (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use `model.predict()` instead.
In [26]:
submission = pd.DataFrame()
submission['id'] = test['id']
submission['is_iceberg'] = predicted_test.reshape((predicted_test.shape[0]))
submission.to_csv('sub.csv', index=False)

 

github 소스코드 바로가기