Competition/Kaggle

[kaggle][필사] Spooky Author Identification

bisi 2021. 1. 22. 12:10

이번 주제는 Spooky Author Identification 이다. 

 

공포이야기가 쓰여진 책의 문장의 단어를 분석하여 작가를 예측하는 모델을 구현 한다. 

 

제출은 id(문장에대한 고유한 id) 별로 3명의 작가에 대한 각각의 확률을 구한다.

 

id, EAP, HPL, MWS
id07943,0.33,0.33,0.33
...

 

 

Abhishek Thakur님의 Approaching (Almost) Any NLP Problem on Kaggle 를 참고하여 자연어 분석어를 진행하였다.

 

 

 

이제 차근차근 따라가 봅시다. 

 

 

 


 

 

 

 

1. 데이터 준비

In [61]:
import pandas as pd
import numpy as np
import xgboost as xgb
from tqdm import tqdm
from sklearn.svm import SVC
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from keras.layers import GlobalAveragePooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import  sequence, text
from keras.callbacks import EarlyStopping
from nltk import word_tokenize
In [40]:
import nltk
nltk.download('stopwords')
stop_words = stopwords.words('english')
 
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HANBIT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
In [3]:
from nltk.corpus import stopwords

train = pd.read_csv('../data/spooky-author-identification/train/train.csv')
test = pd.read_csv('../data/spooky-author-identification/test/test.csv')
sample = pd.read_csv('../data/spooky-author-identification/sample_submission/sample_submission.csv')
In [4]:
train.head()
Out[4]:
  id text author
0 id26305 This process, however, afforded me no means of... EAP
1 id17569 It never once occurred to me that the fumbling... HPL
2 id11008 In his left hand was a gold snuff box, from wh... EAP
3 id27763 How lovely is spring As we looked from Windsor... MWS
4 id12958 Finding nothing else, not even gold, the Super... HPL
In [5]:
test.head()
Out[5]:
  id text
0 id02310 Still, as I urged our leaving Ireland with suc...
1 id24541 If a fire wanted fanning, it could readily be ...
2 id00134 And when they had broken down the frail door t...
3 id27757 While I was thinking how I should possibly man...
4 id04081 I am not sure to what limit his knowledge may ...
In [6]:
sample.head()
Out[6]:
  id EAP HPL MWS
0 id02310 0.403494 0.287808 0.308698
1 id24541 0.403494 0.287808 0.308698
2 id00134 0.403494 0.287808 0.308698
3 id27757 0.403494 0.287808 0.308698
4 id04081 0.403494 0.287808 0.308698
 

우리의 목표는 text를 분석하여 EAP, HPL, MWS와 같은 저자를 예측해야 한다. 간단히 말해서, 3개의 다른 클래스로 텍스트 분류하는 것이다. 해결방법은 캐글은 multi-class log-loss를 평가 지표를 지정하였다. 자세한 것은 해당 사이트 참고

In [7]:
def multiclass_logloss(actual, predicted, eps=1e-15):
    """
    Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param prdicted: Matrix with class predictions, one probability per class
    """
    if len (actual.shape) == 1: # 실제값이 하나라면 배열 다시 생성.
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual= actual2

    clip = np.clip(predicted, eps, 1-eps)
    rows = actual.shape[0]
    vstoa = np.sum(actual * np.log(clip))
    return -1.0 / rows * vstoa
In [8]:
# scikit learn의 LabelEncoder를 해보자. 0,1,2로 변환할 것이다.

lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(train.author.values)
y
Out[8]:
array([0, 1, 0, ..., 0, 0, 1])
In [9]:
# 그리고, 데이터셋을 training 데이터와 검증 데이터 셋으로 나눠준다. scikit-learn의 model_selection을 사용하여 나눠준다.

xtrain, xvalid, ytrain, yvalid = train_test_split(train.text.values, y, stratify=y,
                                                  random_state=42, test_size=0.1, shuffle=True)
In [10]:
print( xtrain.shape)
print( xvalid.shape)

print( ytrain.shape)
print( yvalid.shape)
 
(17621,)
(1958,)
(17621,)
(1958,)
 

2. 자연어 모델 구현

2.1 TF-IDF

첫번째 모델은 TF-IDF(Term Frequency - Inverse Document Frequency)이다. 이것은 간단한 Logistic Regression을 따른다.

In [11]:
# 아래와 같은 feature들로 대부분 시작한다.
# TfidfVectorizer : 단어 카운트 가중치 
# min_df : DF(document-frequency)의 최소 빈도값 설정, DF는 특정 단어가 나타나는 '문서의 수'를 의미,단어의 수가 아니라!, 3이면 1,2인 것들을 탈락함 
# analyzer : 'word', 'char' 중 선택
# sublinear_tf : TF(Term-Freqeuncy,단어빈도) 값의 스무딩 여부를 결정하는 파라미터 T/F , 높은 TF값을 완만하게 처리하는 효과. 아웃라이어가 너무 심한 경우 사용
# ngram_range : 단어의 묶음, ex) very good은 두 단어가 묶여야 정확한 의미가 살아난다.
# max_features : tf-idf vector의 최대 feature를 설정해주는 것, tf-idf 벡터는 단어사전의 인덱스만큼 feature를 부여받음. 종류의 숫자를 제한.
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',
                      token_pattern=r'\w{1,}', ngram_range=(1,3), use_idf=1, smooth_idf=1, sublinear_tf=1,
                      stop_words='english')
# training, test set 둘다 TF-IDF 로 fit하기, 벡터라이저가 단어들을 학습시킨다. 
tfv.fit(list(xtrain) + list(xvalid))
tfv.vocabulary_ # 벡터라이저가 학습한 단어사전을 출력한다.
Out[11]:
{'hair': 5920,
 'brightest': 1489,
 'living': 7734,
 'gold': 5641,
 'despite': 3312,
 'poverty': 10050,
 'clothing': 2161,
 'set': 11813,
 'crown': 2813,
 'distinction': 3647,
 'head': 6078,
 'said': 11446,
 'oh': 9156,
 'member': 8268,
 'family': 4746,
 'niece': 8899,
 'accomplished': 72,
 'woman': 14862,
 'magistrate': 7995,
 'appeared': 537,
 'perfectly': 9656,
 'continued': 2559,
 'attentive': 814,
 'interested': 6950,
 'saw': 11546,
 'shudder': 12014,
 'horror': 6340,
 'lively': 7729,
 'surprise': 13042,
 'unmingled': 14071,
 'disbelief': 3534,
 'painted': 9422,
 'countenance': 2683,
 'horrible': 6335,
 'eyes': 4657,
 'blacker': 1239,
 'seared': 11684,
 'face': 4688,
 'opened': 9245,
 'wide': 14739,
 'expression': 4594,
 'unable': 13912,
 'interpret': 6965,
 'longer': 7805,
 'bent': 1159,
 'ground': 5837,
 'like': 7638,
 'nursed': 9023,
 'flower': 5088,
 'spring': 12530,
 'shooting': 11954,
 'strength': 12769,
 'weighed': 14644,
 'blossoms': 1292,
 'does': 3699,
 'lord': 7836,
 'say': 11563,
 'falling': 4730,
 'discord': 3555,
 'concord': 2394,
 'great': 5741,
 'sweetness': 13101,
 'music': 8710,
 'hath': 6042,
 'agreement': 281,
 'affections': 234,
 'better': 1187,
 'brown': 1536,
 'jenkin': 7092,
 'rubbing': 11281,
 'kind': 7230,
 'affectionate': 233,
 'ankles': 462,
 'black': 1230,
 'man': 8051,
 'deep': 3123,
 'mud': 8676,
 'largely': 7390,
 'concealed': 2368,
 'brown jenkin': 1537,
 'black man': 1234,
 'hours': 6369,
 'meditate': 8242,
 'till': 13535,
 'hunger': 6449,
 'fatigue': 4831,
 'brought': 1530,
 'passing': 9542,
 'hour': 6362,
 'marked': 8136,
 'long': 7782,
 'shadows': 11853,
 'cast': 1773,
 'descending': 3272,
 'sun': 12949,
 'descending sun': 3273,
 'time': 13548,
 'pierre': 9798,
 'imposed': 6622,
 'noble': 8937,
 'birth': 1215,
 'placed': 9850,
 'association': 744,
 'company': 2317,
 'weeping': 14642,
 'clung': 2176,
 'cried': 2783,
 'unkind': 14056,
 'lionel': 7683,
 'pretty': 10166,
 'miss': 8435,
 'received': 10686,
 'visits': 14433,
 'approaching': 572,
 'marriage': 8141,
 'young': 15058,
 'englishman': 4253,
 'john': 7110,
 'esq': 4376,
 'struck': 12809,
 'space': 12418,
 'elapsed': 4072,
 'madness': 7984,
 'reasonable': 10669,
 'impulse': 6650,
 'regulated': 10789,
 'actions': 123,
 'space time': 12419,
 'time elapsed': 13552,
 'did': 3400,
 'proceed': 10231,
 'stairs': 12576,
 'relation': 10807,
 'french': 5270,
 'ravings': 10617,
 'rhoby': 11101,
 'harris': 6022,
 'inhabitants': 6809,
 'shunned': 12020,
 'house': 6374,
 'imagination': 6544,
 'future': 5365,
 'discovery': 3565,
 'determine': 3346,
 'rhoby harris': 11102,
 'shunned house': 12021,
 'sitting': 12138,
 'chest': 1967,
 'spices': 12486,
 'sat': 11521,
 'realise': 10649,
 'passed': 9529,
 'time passed': 13563,
 'faint': 4715,
 'fourth': 5231,
 'handle': 5967,
 'fumes': 5334,
 'begun': 1097,
 'penetrate': 9625,
 'mask': 8162,
 'recovered': 10720,
 'hole': 6289,
 'emitting': 4160,
 'fresh': 5278,
 'vapours': 14264,
 'calmly': 1665,
 'left': 7499,
 'little': 7705,
 'previously': 10179,
 'called': 1653,
 'shaky': 11862,
 'whisper': 14716,
 'portentous': 10001,
 'loudest': 7865,
 'shriek': 11992,
 'god': 5623,
 'seeing': 11723,
 'answer': 479,
 'occupied': 9115,
 'arranging': 645,
 'cottage': 2668,
 'old': 9168,
 'walked': 14520,
 'minutes': 8405,
 'leaning': 7466,
 'arm': 628,
 'youth': 15081,
 'young woman': 15077,
 'occupied arranging': 9116,
 'old man': 9187,
 'leaning arm': 7467,
 'flies': 5063,
 'caught': 1799,
 'high': 6231,
 'born': 1375,
 'powerful': 10054,
 'bow': 1404,
 'necks': 8831,
 'flimsy': 5068,
 'unmeaning': 14069,
 'pretensions': 10164,
 'high born': 6232,
 'ape': 517,
 'approached': 570,
 'casement': 1767,
 'mutilated': 8728,
 'burden': 1578,
 'sailor': 11485,
 'shrank': 11990,
 'aghast': 266,
 'rod': 11210,
 'gliding': 5586,
 'hurried': 6457,
 'home': 6298,
 'dreading': 3824,
 'consequences': 2480,
 'butchery': 1618,
 'gladly': 5560,
 'abandoning': 2,
 'terror': 13327,
 'solicitude': 12323,
 'fate': 4817,
 'ourang': 9335,
 'outang': 9337,
 'ourang outang': 9336,
 'reward': 11097,
 'promised': 10298,
 'detested': 3351,
 'toils': 13608,
 'consolation': 2499,
 'unparalleled': 14090,
 'sufferings': 12911,
 'prospect': 10342,
 'day': 2993,
 'miserable': 8423,
 'slavery': 12180,
 'claim': 2066,
 'elizabeth': 4112,
 'forget': 5178,
 'past': 9551,
 'union': 14042,
 'think': 13396,
 'groans': 5825,
 'clerval': 2109,
 'ears': 3980,
 'beloved': 1135,
 'father': 4820,
 'words': 14896,
 'truly': 13822,
 'forgive': 5181,
 'entirely': 4313,
 'possess': 10018,
 'heart': 6119,
 'endeavoured': 4219,
 'rainbow': 10552,
 'gleams': 5578,
 'cataract': 1791,
 'd': 2899,
 'soften': 12300,
 'thy': 13514,
 'tremendous': 13776,
 'sorrows': 12377,
 'oh beloved': 9157,
 'beloved father': 1136,
 'interior': 6955,
 'apparently': 530,
 'details': 3334,
 'constantly': 2506,
 'changing': 1900,
 'certain': 1861,
 'faces': 4692,
 'furniture': 5353,
 'room': 11231,
 'doors': 3751,
 'windows': 14797,
 'just': 7160,
 'state': 12618,
 'presumably': 10157,
 'mobile': 8465,
 'objects': 9070,
 'house old': 6384,
 'old house': 9182,
 'close': 2135,
 'spot': 12517,
 'stood': 12704,
 'solitary': 12326,
 'rock': 11205,
 &##39;conical': 2450,
 'divided': 3678,
 'mountain': 8623,
 'nature': 8797,
 'hewn': 6216,
 'pyramid': 10458,
 'labour': 7319,
 'block': 1280,
 'reduced': 10735,
 'perfect': 9650,
 'shape': 11886,
 'narrow': 8772,
 'cell': 1833,
 'beneath': 1144,
 'raymond': 10621,
 'short': 11964,
 'inscription': 6863,
 'carved': 1762,
 'stone': 12694,
 'recorded': 10716,
 'tenant': 13284,
 'cause': 1801,
 'death': 3052,
 'rock high': 11206,
 's': 11341,
 'march': 8122,
 'mountains': 8627,
 'snowy': 12279,
 'come': 2251,
 'health': 6095,
 'committing': 2292,
 'loved': 7877,
 'ones': 9221,
 'charge': 1916,
 'tree': 13770,
 'humanity': 6432,
 'send': 11764,
 'late': 7400,
 'posterity': 10031,
 'tale': 13189,
 'ante': 485,
 'pestilential': 9736,
 'race': 10525,
 'heroes': 6206,
 'sages': 11444,
 'lost': 7853,
 'things': 13380,
 'close day': 2136,
 'day s': 3008,
 'state things': 12622,
 'rim': 11151,
 'screwed': 11643,
 'large': 7384,
 'tube': 13838,
 'condenser': 2402,
 'body': 1328,
 'machine': 7956,
 'course': 2704,
 'chamber': 1883,
 'gum': 5883,
 'elastic': 4074,
 'gum elastic': 5885,
 'feel': 4876,
 'inconvenience': 6703,
 'weather': 14624,
 'busy': 1613,
 'scenes': 11601,
 'evil': 4444,
 'despair': 3302,
 'did feel': 3416,
 'soon': 12350,
 'conquered': 2466,
 'latent': 7407,
 'distaste': 3640,
 'watch': 14575,
 'perdita': 9644,
 'mind': 8380,
 'thing': 13370,
 'heard': 6102,
 'roughly': 11254,
 'study': 12827,
 'desire': 3293,
 'wisest': 14824,
 'men': 8275,
 'creation': 2761,
 'world': 14915,
 'grasp': 5717,
 'wisest men': 14825,
 'unintelligible': 14040,
 'matters': 8206,
 'justly': 7176,
 'instructed': 6912,
 'regard': 10771,
 'supreme': 13015,
 'end': 4207,
 'month': 8537,
 'suddenly': 12899,
 'quitted': 10508,
 'sic': 12033,
 'servant': 11805,
 'departed': 3230,
 'country': 2694,
 'word': 14893,
 'writing': 14977,
 'informing': 6799,
 'intentions': 6945,
 'quitted house': 10510,
 'cohort': 2202,
 'remained': 10836,
 'torches': 13651,
 'faded': 4705,
 'watched': 14576,
 'thought': 13434,
 'fantastic': 4767,
 'sky': 12160,
 'spectral': 12462,
 'luminosity': 7907,
 'flowed': 5087,
 'determined': 3347,
 'depart': 3229,
 'live': 7725,
 'leave': 7480,
 'continue': 2558,
 'exist': 4536,
 'drop': 3866,
 'enigmas': 4260,
 'resolved': 10981,
 'let': 7549,
 'ensue': 4282,
 'force': 5153,
 'passage': 9526,
 'moon': 8548,
 'resolved let': 10982,
 'force passage': 5154,
 'fact': 4697,
 'examine': 4466,
 'causes': 1803,
 'life': 7588,
 'recourse': 10718,
 'bitter': 1224,
 'swiftly': 13108,
 'shadowed': 11851,
 'joy': 7130,
 'endeavouring': 4220,
 'master': 8179,
 'passion': 9544,
 'tears': 13245,
 'threatened': 13471,
 'eyes ground': 4670,
 'automaton': 863,
 'piece': 9792,
 'distinct': 3644,
 'motion': 8605,
 'observable': 9085,
 'shoulder': 11977,
 'slight': 12202,
 'degree': 3167,
 'drapery': 3807,
 'covering': 2720,
 'just beneath': 7161,
 'left shoulder': 7510,
 'slight degree': 12203,
 'grew': 5803,
 'sterner': 12670,
 'elderly': 4081,
 'eager': 3954,
 'questions': 10490,
 'fell': 4896,
 'involuntarily': 7021,
 'convulsed': 2611,
 'lips': 7686,
 'plague': 9859,
 'insistent': 6882,
 'unendurable': 13997,
 'cacophony': 1637,
 'constant': 2504,
 'terrifying': 13325,
 'impression': 6634,
 'sounds': 12399,
 'regions': 10779,
 'trembling': 13775,
 'brink': 1506,
 'different': 3462,
 'interests': 6952,
 'absorbed': 38,
 'single': 12100,
 'contemplation': 2540,
 'horrors': 6344,
 'reached': 10630,
 'relate': 10803,
 'tedious': 13251,
 'educated': 4037,
 'unusual': 14145,
 'powers': 10056,
 'infected': 6773,
 'misanthropy': 8420,
 'subject': 12855,
 'perverse': 9730,
 'moods': 8546,
 'alternate': 373,
 'enthusiasm': 4307,
 'melancholy': 8261,
 'powers mind': 10058,
 'chase': 1932,
 'nameless': 8757,
 'entity': 4316,
 'quite': 10502,
 'trifling': 13796,
 'difficulty': 3467,
 'bag': 945,
 'attorney': 822,
 'adrian': 189,
 'introduced': 6994,
 'systematic': 13148,
 'modes': 8479,
 'proceeding': 10234,
 'metropolis': 8344,
 'stop': 12714,
 'progress': 10280,
 'prevented': 10174,
 'evils': 4448,
 'vice': 14359,
 'folly': 5130,
 'rendering': 10875,
 'awful': 906,
 'ordinary': 9292,
 'parisian': 9490,
 'gateway': 5433,
 'glazed': 5574,
 'box': 1411,
 'sliding': 12201,
 'window': 14793,
 'indicating': 6733,
 'forgot': 5183,
 'distance': 3635,
 'thee': 13350,
 'eye': 4646,
 'removed': 10869,
 'glass': 5570,
 'scarce': 11585,
 'discern': 3537,
 'forms': 5200,
 'crowd': 2810,
 'mile': 8367,
 'surrounded': 13048,
 'gate': 5430,
 'form': 5186,
 'learned': 7473,
 'knowing': 7293,
 'advanced': 195,
 'age': 258,
 'journey': 7125,
 'wretched': 14960,
 'sickness': 12039,
 'make': 8027,
 'spared': 12429,
 'grief': 5814,
 'concealing': 2369,
 'extent': 4611,
 'disorder': 3599,
 'father s': 4823,
 'long journey': 7793,
 'shadowy': 11854,
 'solitude': 12327,
 'longing': 7810,
 'light': 7614,
 'frantic': 5253,
 'rest': 11006,
 'lifted': 7610,
 'entreating': 4323,
 'hands': 5970,
 'ruined': 11304,
 'tower': 13683,
 'forest': 5173,
 'unknown': 14058,
 'outer': 9339,
 'second': 11698,
 'brilliant': 1493,
 'air': 297,
 'art': 660,
 'wonders': 14880,
 'said second': 11472,
 'prime': 10190,
 'festivals': 4938,
 'held': 6172,
 'weary': 14623,
 'talking': 13197,
 'dreaming': 3832,
 'perdita s': 9645,
 's cottage': 11356,
 'perdita s cottage': 9646,
 'months': 8539,
 'eve': 4418,
 'talk': 13194,
 'queer': 10479,
 'earth': 3981,
 'noises': 8949,
 'clear': 2100,
 'arkham': 627,
 'night': 8902,
 'lately': 7405,
 'community': 2309,
 'dead': 3022,
 'burnt': 1594,
 'order': 9287,
 'prevent': 10172,
 'injurious': 6827,
 'public': 10388,
 'peace': 9592,
 'imagine': 6548,
 'point': 9936,
 'view': 14371,
 'section': 11715,
 'point view': 9939,
 'wilbur': 14755,
 'growing': 5846,
 'uncannily': 13935,
 'looked': 7815,
 'boy': 1415,
 'entered': 4290,
 'year': 15010,
 'looked like': 7817,
 'checked': 1946,
 'anxiety': 505,
 'doomed': 3733,
 'toil': 13606,
 'mines': 8390,
 'unwholesome': 14158,
 'trade': 13703,
 'artist': 679,
 'favourite': 4840,
 'employment': 4176,
 'kinds': 7237,
 'o': 9032,
 'cities': 2043,
 'sea': 11664,
 'island': 7050,
 'heaved': 6139,
 'thar': 13347,
 'irony': 7038,
 'ryland': 11339,
 'roused': 11262,
 'resistance': 10974,
 'asserted': 730,
 'permitted': 9683,
 'encrease': 4201,
 'party': 9520,
 'indulgence': 6750,
 'sweep': 13091,
 'away': 900,
 'cobwebs': 2193,
 'blinded': 1274,
 'countrymen': 2697,
 'came': 1667,
 'gently': 5496,
 'stealthily': 12645,
 'attained': 797,
 'appreciation': 563,
 'spirit': 12494,
 'length': 7531,
 'properly': 10318,
 'entertain': 4302,
 'figures': 4970,
 'judges': 7142,
 'vanished': 14255,
 'tall': 13199,
 'candles': 1692,
 'sank': 11510,
 'nothingness': 8995,
 'flames': 5036,
 'went': 14655,
 'utterly': 14213,
 'blackness': 1243,
 'darkness': 2968,
 'supervened': 12992,
 'sensations': 11767,
 'swallowed': 13079,
 'mad': 7961,
 'rushing': 11330,
 'descent': 3275,
 'soul': 12386,
 'carriage': 1750,
 'measured': 8227,
 'express': 4590,
 'rectangular': 10723,
 'precision': 10101,
 'attending': 809,
 'movement': 8639,
 'observed': 9089,
 'diminutive': 3491,
 'figure': 4967,
 'affectation': 228,
 'constraint': 2512,
 'noticed': 8998,
 'gentleman': 5489,
 'dimensions': 3486,
 'readily': 10641,
 'account': 83,
 'reserve': 10963,
 'hauteur': 6054,
 'sense': 11768,
 'dignity': 3474,
 'colossal': 2237,
 'proportion': 10325,
 'wholly': 14733,
 'sorry': 12378,
 'loss': 7851,
 'abysses': 53,
 'closely': 2144,
 'written': 14979,
 'sheets': 11908,
 'explained': 4573,
 'erich': 4358,
 'zann': 15092,
 'erich zann': 4359,
 'pulled': 10400,
 'steel': 12649,
 'claws': 2093,
 'neck': 8830,
 'dragged': 3797,
 'beldame': 1113,
 'edge': 4024,
 'gulf': 5880,
 'access': 61,
 'closed': 2140,
 'lovely': 7881,
 'child': 1975,
 'nearly': 8820,
 'years': 15015,
 'nearly years': 8823,
 'years age': 15016,
 'hoary': 6276,
 'forth': 5207,
 'hand': 5955,
 'helped': 6180,
 'olney': 9212,
 'host': 6351,
 'vast': 14283,
 'shell': 11910,
 'conches': 2386,
 'wild': 14761,
 'awesome': 905,
 'clamour': 2070,
 'uncle': 13943,
 'lay': 7435,
 'carelessly': 1738,
 'dug': 3908,
 'open': 9232,
 'pit': 9833,
 'angry': 451,
 'framed': 5245,
 'straggling': 12731,
 'locks': 7761,
 'cornered': 2639,
 'hats': 6046,
 'cornered hats': 2640,
 'aware': 899,
 'apparent': 528,
 'paradox': 9471,
 'occasioned': 9110,
 'visual': 14436,
 'area': 610,
 'susceptible': 13062,
 'feeble': 4871,
 'impressions': 6637,
 'exterior': 4612,
 'portions': 10007,
 'retina': 11038,
 'mercy': 8310,
 'known': 7295,
 'hill': 6248,
 'remote': 10862,
 'bit': 1220,
 'backwoods': 936,
 'seat': 11688,
 'uncomfortable': 13948,
 'superstitions': 12990,
 'known better': 7296,
 'balloon': 955,
 'ascensions': 693,
 'considerable': 2485,
 'height': 6165,
 'pain': 9413,
 'respiration': 11000,
 'uneasiness': 13994,
 'experienced': 4563,
 'accompanied': 67,
 'bleeding': 1261,
 'nose': 8984,
 'symptoms': 13145,
 'alarming': 307,
 'inconvenient': 6704,
 'altitude': 376,
 'great uneasiness': 5775,
 'circulation': 2033,
 'property': 10320,
 'supported': 13002,
 'wants': 14547,
 'society': 12293,
 'sudden': 12898,
 'hideous': 6224,
 'boundaries': 1395,
 'private': 10213,
 'possession': 10022,
 'thrown': 13495,
 'products': 10255,
 'human': 6412,
 'present': 10130,
 'existing': 4540,
 'far': 4770,
 'thinned': 13405,
 'generation': 5475,
 'possibly': 10029,
 'consume': 2523,
 'tried': 13791,
 'organ': 9297,
 'considered': 2490,
 'cavities': 1816,
 'result': 11022,
 'action': 122,
 'water': 14584,
 'believed': 1123,
 'america': 387,
 'taint': 13176,
 'yellow': 15031,
 'fever': 4942,
 'epidemic': 4335,
 'gifted': 5526,
 'virulence': 14414,
 'unexplored': 14006,
 'unfrequently': 14020,
 'visited': 14428,
 'recess': 10693,
 'amid': 393,
 'woods': 14889,
 'groves': 5844,
 'moment': 8495,
 'imagined': 6549,
 'evening': 4421,
 'mother': 8597,
 'sister': 12132,
 'arrived': 653,
 'mother sister': 8603,
 'consented': 2478,
 'request': 10943,
 'taking': 13184,
 'followed': 5123,
 'wonderful': 14874,
 'ingenious': 6802,
 'add': 144,
 'mr': 8643,
 'believe': 1120,
 'useful': 14194,
 'mechanical': 8231,
 'contrivances': 2587,
 'daily': 2916,
 'springing': 12532,
 'ah': 284,
 'sure': 13018,
 'needless': 8838,
 'general': 5465,
 'smith': 12259,
 'heightened': 6168,
 'exalted': 4462,
 'opinion': 9260,
 'valuable': 14247,
 'enjoy': 4262,
 'invention': 7009,
 'needless say': 8839,
 'legs': 7527,
 'portion': 10002,
 'story': 12727,
 'ethelred': 4402,
 'hero': 6205,
 'having': 6056,
 'sought': 12383,
 'vain': 14229,
 'admission': 176,
 'dwelling': 3942,
 'hermit': 6204,
 'proceeds': 10236,
 'good': 5653,
 'entrance': 4318,
 'obviously': 9105,
 'prominent': 10295,
 'incongruities': 6696,
 'deductions': 3117,
 'shall': 11863,
 'lead': 7454,
 'truth': 13831,
 'receive': 10685,
 'hermann': 6203,
 'letter': 7562,
 'matter': 8202,
 'conversation': 2596,
 'inner': 6839,
 'everlasting': 4434,
 'treatise': 13767,
 'duelli': 3904,
 'lex': 7571,
 'scripta': 11645,
 'et': 4393,
 'non': 8953,
 'aliterque': 333,
 'matter course': 8203,
 'duelli lex': 3905,
 'lex scripta': 7572,
 'scripta et': 11646,
 'et non': 4394,
 'non aliterque': 8954,
 'duelli lex scripta': 3906,
 'lex scripta et': 7573,
 'scripta et non': 11647,
 'et non aliterque': 4395,
 'wyatt': 14988,
 'rooms': 11238,
 'cabin': 1634,
 'separated': 11791,
 'main': 8008,
 'door': 3734,
 'locked': 7755,
 'wyatt s': 14989,
 'door locked': 3746,
 'pursued': 10445,
 'enemy': 4236,
 'wings': 14806,
 'feet': 4884,
 'flew': 5058,
 'apartment': 514,
 'dismissed': 3597,
 'attendants': 807,
 'threw': 13475,
 'wildly': 14769,
 'floor': 5077,
 'blood': 1284,
 'suppress': 13012,
 'shrieks': 11996,
 'prey': 10180,
 'vulture': 14487,
 'striving': 12796,
 'multitudinous': 8685,
 'ideas': 6498,
 'horrid': 6337,
 'furies': 5346,
 'cruel': 2818,
 'poured': 10047,
 'swift': 13107,
 'succession': 12892,
 'wound': 14941,
 'worked': 14908,
 'devil': 3359,
 'vow': 14480,
 'vengeance': 14318,
 'devote': 3370,
 'fiend': 4956,
 'torture': 13664,
 'involuntary': 7022,
 'got': 5684,
 'oven': 9356,
 'alive': 334,
 'certainly': 1867,
 'turn': 13862,
 'scrutinizing': 11654,
 'machinery': 7959,
 'moving': 8642,
 'mechanism': 8234,
 'changed': 1895,
 'position': 10014,
 'accounted': 85,
 'simple': 12087,
 'laws': 7434,
 'perspective': 9711,
 'subsequent': 12869,
 'examinations': 4465,
 'convinced': 2609,
 'undue': 13988,
 'alterations': 370,
 'mirrors': 8416,
 'trunk': 13826,
 'organic': 9298,
 'tended': 13288,
 'awake': 893,
 'vague': 14227,
 'memories': 8272,
 'conscious': 2474,
 'idea': 6495,
 'mockingly': 8470,
 'resembled': 10957,
 'suggested': 12922,
 'pity': 9846,
 'charmion': 1927,
 'majesty': 8024,
 'speculative': 12469,
 'merged': 8317,
 'august': 847,
 'oh god': 9159,
 'told': 13612,
 'eighty': 4066,
 'tragedies': 13709,
 'ninety': 8927,
 'speeches': 12472,
 'treatises': 13768,
 'eighth': 4065,
 'book': 1360,
 'hymns': 6479,
 'homer': 6300,
 'junior': 7155,
 'refused': 10762,
 'dropped': 3867,
 'later': 7408,
 'moment later': 8498,
 'reply': 10914,
 'know': 7281,
 ...}
In [12]:
xtrain_tfv = tfv.transform(xtrain)
xvalid_tfv = tfv.transform(xvalid)
 
 

2.2 Logistic Regression

In [13]:
# TF IDF로 간단한 Logistic Regression Fit하기.
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print("logloss: %0.3f " %multiclass_logloss(yvalid, predictions))
 
logloss: 0.572 
 
c:\users\hanbit\appdata\local\programs\python\python37\lib\site-packages\sklearn\linear_model\_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
 

우리는 multiclass logloss로 0.572의 손실을 얻었다.

2.3 counter 변수 사용. 

  • 더 좋은 점수를 얻기 위해 다른 데이터로 동일한 모델을 적용해본다.
  • TF-IDF를 사용하기 전에, 우리는 feature로써 word count를 사용할 수 있다.
  • 이것은 scikit-learn의 CountVectorizer로 쉽게 사용할 수 있다.
In [14]:
# CountVectorizer : 문서 집합에서 단어 토큰을 생성하고 각 단어의 수를 세어 BOW 인코딩 벡터를 만든다.
ctv = CountVectorizer(analyzer='word', token_pattern= r'\w{1,}', ngram_range=(1,3), stop_words='english')

# training, test 데이터셋 모두 fit count vectorizer fit하자.

ctv.fit(list(xtrain) + list(xvalid))
xtrain_ctv = ctv.transform(xtrain)
xvalid_ctv = ctv.transform(xvalid)
In [15]:
clf  = LogisticRegression(C=1.0)
clf.fit(xtrain_ctv,ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print("logloss: %0.3f " %multiclass_logloss(yvalid, predictions))
 
logloss: 0.527 
 
c:\users\hanbit\appdata\local\programs\python\python37\lib\site-packages\sklearn\linear_model\_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
 

0.527 의 손실로, 더 나아지진 않았다.

2.4 Naive-Bayes

다음으로는 에전에 유명했던 간단한 모델인 Naive-Bayes를 사용해보자. 우선 두 데이터 셋으로 naive-bayes를 적용하면 어떻게 되는지 보자.

In [16]:
clf = MultinomialNB()
clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print('logloss: %0.3f' % multiclass_logloss(yvalid, predictions))
 
logloss: 0.578
 

위의 두 모델보보는 훨씬 더 좋아졌다. 하지만, 여전히 logistic regression이 훨씬 더 좋다. 이 모델을 count data로 대신 사용하였을때는 어떨까?

In [17]:
clf = MultinomialNB()
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print("logloss: %0.3f" % multiclass_logloss(yvalid, predictions))
 
logloss: 0.485
 

더 나아진 것 같진 않다. 오래된 모델이 더 작동하는 것 가다.

2.4 SVM

이번엔 SVM을 적용해 본다. SVM은 시간이 많이 걸리므로 SVM 적용 전에 특이값 분해(Singular Value Decomposition)를 사용하여 TF-IDF로부터 feature의 갯수를 줄여준다.

또, SVM을 적용하기 전에 데이터를 표준화해 한다는 점을 주의해아한다.

In [18]:
svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(xtrain_tfv)
xtrain_svd = svd.transform(xtrain_tfv)
xvalid_svd = svd.transform(xvalid_tfv)

# from SVD로 부터 데이터를 얻어 스케일링하기. 변수명은 다시 사용하여 짓는다.
scl = preprocessing.StandardScaler()
scl.fit(xtrain_svd)
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)
In [19]:
# 자 이제 SVM을 돌려보자.
clf = SVC(C=1.0, probability=True)
clf.fit(xtrain_svd_scl, ytrain)
predictions = clf.predict_proba(xvalid_svd_scl)

print("logloss: %0.3f" % multiclass_logloss(yvalid, predictions))
 
logloss: 0.732
 

2.5 xgboost

그 다음은 캐글에서 인기 있는 xgboost를 적용해보자.

In [20]:
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, subsample=0.8,
                        nthread=10, learning_rate=0.1)
clf.fit(xtrain_tfv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_tfv.tocsc())

print("logloss: %0.3f "% multiclass_logloss(yvalid, predictions))
 
logloss: 0.781 
In [21]:
xtrain_ctv.tocsc()
Out[21]:
<17621x400266 sparse matrix of type '<class 'numpy.int64'>'
	with 556265 stored elements in Compressed Sparse Column format>
In [22]:
# 같은 알고리즘으로 몇번씩 돌려보자.
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
                       subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_ctv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_ctv.tocsc())

print("logloss: %0.3f" % multiclass_logloss(yvalid, predictions))
 
logloss: 0.772
In [23]:
# 같은 알고리즘으로 몇번씩 돌려보자.
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8,
                       subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_ctv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_ctv.tocsc())

print("logloss: %0.3f" % multiclass_logloss(yvalid, predictions))
 
logloss: 0.772
In [24]:
# 이번엔 nthread 옵션만 넣어보자.

clf = xgb.XGBClassifier(nthread=10)
clf.fit(xtrain_svd, ytrain)
predictions=clf.predict_proba(xvalid_svd)

print("logloss : %0.3f" % multiclass_logloss(yvalid, predictions))
 
logloss : 0.796
 

결과를 보면 xgboost는 운이 없는거 같다. 하지만 정확한 결과는 아니다. 왜냐하면 하이퍼 파라미터 최적화를 진행하지 않았기 때문이다. 지금부터 다뤄보자.

 

그리드 search는 하이퍼 파리미터 최적화 기법이다. 그리 효과적이진 않지만 사용하려는 그리드를 알고 있으면 좋은 결과를 얻을 수 있다. 이 포스트를 참고하여 일반적으로 사용해야하는 매개 변수를 지정하여 사용한다.

주로 사용하는 파라미터는 기억하면 좋다. 최적화를 위한 많은 하이퍼 파라미터는 효과적일 수도 있고 아닐 수도 있다. 이번 섹션에서는 logistic regression을 사용하여 grid search에 대하여 말해볼 것이다. grid search를 시작하기전에, scoring 함수를 만드는 것이 필요하다. scikit-learn의 함수인 make_scorer 를 사용하여 만들어보자.

In [25]:
mll_scorer = metrics.make_scorer(multiclass_logloss, greater_is_better=False, needs_proba=True)
 

다음으로는 pipeline이 필요하다. 여기 예제에서는 SVD,scaling, logistic regression으로 구성된 파이프라인을 사용할 것이다. 파이프라인에 있는 모듈을 하나만 사용하는 것보다 더 많이 사용하는 것이 좋다.

In [45]:
# SVD 초기화
svd = TruncatedSVD()
# standard scaler 초기화
scl = preprocessing.StandardScaler()
# 여기선 logistic regression을 사용할 것이다.
lr_model = LogisticRegression()
# 파이프라인 만들기 
clf = pipeline.Pipeline([('svd', svd),
                        ('scl',scl),
                        ('lr', lr_model)])
 

이제 grid 파라미터가 필요하다.

In [48]:
param_grid = {'svd__n_components' : [120,180],
             'lr__C' : [0.1, 1.0, 10],
             'lr__penalty': ['l1', 'l2']}
 

SVD의 경우 120, 180 성분을 평가하고 로지스틱 회귀 분석의 경우 L1 및 L2 패널티를 사용하여 C의 세 가지 값을 평가한다. 이제 이러한 매개 변수에 대한 그리드 검색을 시작할 수 있다.

In [57]:
# Initialize Grid Search Model
model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
                                 verbose=10, n_jobs=-1, iid=True, refit=True, cv=2)

# Fit Grid Search Model
model.fit(xtrain_tfv, ytrain)  # we can use the full data here but im only using xtrain
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
 
Fitting 2 folds for each of 6 candidates, totalling 12 fits
 
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done   3 out of  12 | elapsed:    2.3s remaining:    7.2s
[Parallel(n_jobs=-1)]: Done   5 out of  12 | elapsed:    2.4s remaining:    3.4s
[Parallel(n_jobs=-1)]: Done   7 out of  12 | elapsed:    2.4s remaining:    1.7s
[Parallel(n_jobs=-1)]: Done   9 out of  12 | elapsed:    2.4s remaining:    0.7s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    2.4s finished
c:\users\hanbit\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_search.py:849: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.
  "removed in 0.24.", FutureWarning
 
Best score: -0.492
Best parameters set:
	nb__alpha: 0.1
 

svm과 비슷한 점수를 받았다. 이 기술은 아래와 같이 xgboost 또한 multinomial naive bayesas 을 미세 조정하는데도 사용할 수 있다. 여기서는 tf-idf에서 사용한다.

In [54]:
nb_model = MultinomialNB()

# Create the pipeline 
clf = pipeline.Pipeline([('nb', nb_model)])

# parameter grid
param_grid = {'nb__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# Initialize Grid Search Model
model = GridSearchCV(estimator=clf, param_grid=param_grid, scoring=mll_scorer,
                                 verbose=10, n_jobs=-1, iid=True, refit=True, cv=2)

# Fit Grid Search Model
model.fit(xtrain_tfv, ytrain)  # we can use the full data here but im only using xtrain. 
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
 
Fitting 2 folds for each of 6 candidates, totalling 12 fits
 
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done   3 out of  12 | elapsed:    2.8s remaining:    8.6s
[Parallel(n_jobs=-1)]: Done   5 out of  12 | elapsed:    2.8s remaining:    4.0s
[Parallel(n_jobs=-1)]: Done   7 out of  12 | elapsed:    2.8s remaining:    2.0s
[Parallel(n_jobs=-1)]: Done   9 out of  12 | elapsed:    2.9s remaining:    0.9s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    2.9s finished
c:\users\hanbit\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_search.py:849: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.
  "removed in 0.24.", FutureWarning
 
Best score: -0.492
Best parameters set:
	nb__alpha: 0.1
 

아마 결과는 순수 naive bayes 점수보다 8% 개선 됬을 것이다. NLP 문제에서는 단어 벡터를 보는 것이 일반적이다. 단어 벡터는 데이터에 대한 많은 통찰력을 제공한다. 자세히 실습을 진행하고 싶으면 kaggle에서 참고해보자. 여기선 바로 딥러닝으로 들어가보겠다.

 

2.7 WordVector

In [69]:
# load the GloVe vectors in a dictionary:

embeddings_index = {}
f = open('../data/glove.840B.300d.txt', encoding='utf8')
for line in tqdm(f):
    values = line.split()
    word = ''.join(values[:-300])
    coefs = np.asarray(values[-300:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))
 
2196017it [04:42, 7783.76it/s] 
 
Found 2195892 word vectors.
 
 

2.8 Deep Learning

지금은 딥러닝의 시대이다. 우리는 신경망 네트워크를 학습하지않고는 살아갈 수 없다. 여기에 LSTM과 간단한 dense network를 활용하여 GloVe 변수를 학습한다. 자 dense network를 첫번째로 시작해봅시다.

In [41]:
import nltk
nltk.download('punkt')
# this function creates a normalized vector for the whole sentence
def sent2vec(s):
#     words = str(s).lower().decode('utf-8')
    words = str(s).lower()
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())
 
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HANBIT\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
In [42]:
xtrain_glove = [sent2vec(x) for x in tqdm(xtrain)]
xvalid_glove = [sent2vec(x) for x in tqdm(xvalid)]
 
100%|██████████████████████████████████████████████████████████████████████████| 17621/17621 [00:04<00:00, 3725.01it/s]
100%|████████████████████████████████████████████████████████████████████████████| 1958/1958 [00:00<00:00, 3372.05it/s]
In [58]:
scl = preprocessing.StandardScaler()
xtrain_glove_scl = scl.fit_transform(xtrain_glove)
xvalid_glove_scl = scl.transform(xvalid_glove)
In [59]:
ytrain_enc = np_utils.to_categorical(ytrain)
yvalid_enc = np_utils.to_categorical(yvalid)
In [62]:
model = Sequential()
model.add(Dense(300, input_dim=300, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(300, activation='relu'))
model.add(Dropout(0.3))
model.add(BatchNormalization())

model.add(Dense(3))
model.add(Activation('softmax'))

# compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam')
In [63]:
model.fit(xtrain_glove_scl, y= ytrain_enc, batch_size=64, epochs=5, verbose=1,
         validation_data=(xvalid_glove_scl, yvalid_enc))
 
Epoch 1/5
276/276 [==============================] - 1s 4ms/step - loss: 1.0885 - val_loss: 1.0876
Epoch 2/5
276/276 [==============================] - 1s 3ms/step - loss: 1.0880 - val_loss: 1.0877
Epoch 3/5
276/276 [==============================] - 1s 4ms/step - loss: 1.0881 - val_loss: 1.0878
Epoch 4/5
276/276 [==============================] - 1s 3ms/step - loss: 1.0881 - val_loss: 1.0878
Epoch 5/5
276/276 [==============================] - 1s 4ms/step - loss: 1.0881 - val_loss: 1.0884
Out[63]:
<tensorflow.python.keras.callbacks.History at 0x1f28e2b19c8>
 

더 나은 결과를 얻기 위해서는 신경 네트워크의 매개 변수를 계속해서 조정하고, 더 많은 레이어를 추가하고, dropout을 늘려야 한다. 하지만 신경네트워크는 최적화 없이도 구현 및 실행이 빠르며 xgboost보다 더 나은 결과를 얻을 수 있다는 것을 확인할 수 있다.

더 나아가려면 LSTM을 사용하여 텍스트 데이터를 토큰화해야 한다.

In [65]:
# 케라스 사용하여 토큰화하기
token = text.Tokenizer(num_words=None)
max_len = 70

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

# zero pad the sequences

xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index
In [73]:
# create an embedding matrix for the words we have in the dataset

embedding_matrix = np.zeros((len(word_index) + 1, 300))

for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
 
100%|████████████████████████████████████████████████████████████████████████| 25943/25943 [00:00<00:00, 306031.39it/s]
In [76]:
# A simple LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
In [77]:
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100,
         verbose=1, validation_data=(xvalid_pad, yvalid_enc))
 
Epoch 1/100
35/35 [==============================] - 61s 2s/step - loss: 1.0650 - val_loss: 0.9455
Epoch 2/100
35/35 [==============================] - 65s 2s/step - loss: 0.9160 - val_loss: 0.7598
Epoch 3/100
35/35 [==============================] - 68s 2s/step - loss: 0.8155 - val_loss: 0.7086
Epoch 4/100
35/35 [==============================] - 71s 2s/step - loss: 0.7866 - val_loss: 0.6971
Epoch 5/100
35/35 [==============================] - 73s 2s/step - loss: 0.7670 - val_loss: 0.6734
Epoch 6/100
35/35 [==============================] - 71s 2s/step - loss: 0.7370 - val_loss: 0.6497
Epoch 7/100
35/35 [==============================] - 71s 2s/step - loss: 0.7283 - val_loss: 0.6542
Epoch 8/100
35/35 [==============================] - 71s 2s/step - loss: 0.7057 - val_loss: 0.6409
Epoch 9/100
35/35 [==============================] - 71s 2s/step - loss: 0.6858 - val_loss: 0.6169
Epoch 10/100
35/35 [==============================] - 71s 2s/step - loss: 0.6722 - val_loss: 0.5906
Epoch 11/100
35/35 [==============================] - 72s 2s/step - loss: 0.6454 - val_loss: 0.6014
Epoch 12/100
35/35 [==============================] - 71s 2s/step - loss: 0.6291 - val_loss: 0.5930
Epoch 13/100
35/35 [==============================] - 72s 2s/step - loss: 0.6134 - val_loss: 0.5683
Epoch 14/100
35/35 [==============================] - 73s 2s/step - loss: 0.5853 - val_loss: 0.5478
Epoch 15/100
35/35 [==============================] - 81s 2s/step - loss: 0.5830 - val_loss: 0.5467
Epoch 16/100
35/35 [==============================] - 81s 2s/step - loss: 0.5651 - val_loss: 0.5351
Epoch 17/100
35/35 [==============================] - 81s 2s/step - loss: 0.5450 - val_loss: 0.5247
Epoch 18/100
35/35 [==============================] - 81s 2s/step - loss: 0.5430 - val_loss: 0.5179
Epoch 19/100
35/35 [==============================] - 81s 2s/step - loss: 0.5278 - val_loss: 0.5287
Epoch 20/100
35/35 [==============================] - 80s 2s/step - loss: 0.5114 - val_loss: 0.5227
Epoch 21/100
35/35 [==============================] - 81s 2s/step - loss: 0.5010 - val_loss: 0.5438
Epoch 22/100
35/35 [==============================] - 80s 2s/step - loss: 0.4936 - val_loss: 0.5302
Epoch 23/100
35/35 [==============================] - 80s 2s/step - loss: 0.4827 - val_loss: 0.5128
Epoch 24/100
35/35 [==============================] - 71s 2s/step - loss: 0.4700 - val_loss: 0.5106
Epoch 25/100
35/35 [==============================] - 80s 2s/step - loss: 0.4727 - val_loss: 0.5080
Epoch 26/100
35/35 [==============================] - 81s 2s/step - loss: 0.4523 - val_loss: 0.5093
Epoch 27/100
35/35 [==============================] - 81s 2s/step - loss: 0.4435 - val_loss: 0.4928
Epoch 28/100
35/35 [==============================] - 81s 2s/step - loss: 0.4363 - val_loss: 0.5070
Epoch 29/100
35/35 [==============================] - 75s 2s/step - loss: 0.4356 - val_loss: 0.4948
Epoch 30/100
35/35 [==============================] - 81s 2s/step - loss: 0.4299 - val_loss: 0.5202
Epoch 31/100
35/35 [==============================] - 81s 2s/step - loss: 0.4138 - val_loss: 0.5015
Epoch 32/100
35/35 [==============================] - 81s 2s/step - loss: 0.4090 - val_loss: 0.4953
Epoch 33/100
35/35 [==============================] - 81s 2s/step - loss: 0.4004 - val_loss: 0.5007
Epoch 34/100
35/35 [==============================] - 81s 2s/step - loss: 0.3959 - val_loss: 0.5078
Epoch 35/100
35/35 [==============================] - 81s 2s/step - loss: 0.3970 - val_loss: 0.5134
Epoch 36/100
35/35 [==============================] - 81s 2s/step - loss: 0.3858 - val_loss: 0.5027
Epoch 37/100
35/35 [==============================] - 88s 3s/step - loss: 0.3715 - val_loss: 0.5112
Epoch 38/100
35/35 [==============================] - 88s 3s/step - loss: 0.3702 - val_loss: 0.5005
Epoch 39/100
35/35 [==============================] - 82s 2s/step - loss: 0.3692 - val_loss: 0.5119
Epoch 40/100
35/35 [==============================] - 81s 2s/step - loss: 0.3644 - val_loss: 0.4920
Epoch 41/100
35/35 [==============================] - 81s 2s/step - loss: 0.3574 - val_loss: 0.5118
Epoch 42/100
35/35 [==============================] - 81s 2s/step - loss: 0.3527 - val_loss: 0.5238
Epoch 43/100
35/35 [==============================] - 81s 2s/step - loss: 0.3464 - val_loss: 0.5599
Epoch 44/100
35/35 [==============================] - 81s 2s/step - loss: 0.3429 - val_loss: 0.4988
Epoch 45/100
35/35 [==============================] - 87s 2s/step - loss: 0.3304 - val_loss: 0.5184
Epoch 46/100
35/35 [==============================] - 87s 2s/step - loss: 0.3303 - val_loss: 0.5414
Epoch 47/100
35/35 [==============================] - 86s 2s/step - loss: 0.3256 - val_loss: 0.5400
Epoch 48/100
35/35 [==============================] - 89s 3s/step - loss: 0.3325 - val_loss: 0.5493
Epoch 49/100
35/35 [==============================] - 86s 2s/step - loss: 0.3227 - val_loss: 0.5107
Epoch 50/100
35/35 [==============================] - 89s 3s/step - loss: 0.3122 - val_loss: 0.5299
Epoch 51/100
35/35 [==============================] - 76s 2s/step - loss: 0.3088 - val_loss: 0.5238
Epoch 52/100
35/35 [==============================] - 83s 2s/step - loss: 0.3141 - val_loss: 0.5299
Epoch 53/100
35/35 [==============================] - 90s 3s/step - loss: 0.3148 - val_loss: 0.5105
Epoch 54/100
35/35 [==============================] - 79s 2s/step - loss: 0.3032 - val_loss: 0.5315
Epoch 55/100
35/35 [==============================] - 80s 2s/step - loss: 0.2961 - val_loss: 0.5659
Epoch 56/100
35/35 [==============================] - 80s 2s/step - loss: 0.3027 - val_loss: 0.5749
Epoch 57/100
35/35 [==============================] - 81s 2s/step - loss: 0.2878 - val_loss: 0.5299
Epoch 58/100
35/35 [==============================] - 81s 2s/step - loss: 0.2955 - val_loss: 0.5514
Epoch 59/100
35/35 [==============================] - 80s 2s/step - loss: 0.2784 - val_loss: 0.5671
Epoch 60/100
35/35 [==============================] - 79s 2s/step - loss: 0.2771 - val_loss: 0.5685
Epoch 61/100
35/35 [==============================] - 79s 2s/step - loss: 0.2769 - val_loss: 0.5415
Epoch 62/100
35/35 [==============================] - 79s 2s/step - loss: 0.2736 - val_loss: 0.5809
Epoch 63/100
35/35 [==============================] - 80s 2s/step - loss: 0.2710 - val_loss: 0.5657
Epoch 64/100
35/35 [==============================] - 80s 2s/step - loss: 0.2658 - val_loss: 0.5644
Epoch 65/100
35/35 [==============================] - 79s 2s/step - loss: 0.2601 - val_loss: 0.5429
Epoch 66/100
35/35 [==============================] - 79s 2s/step - loss: 0.2642 - val_loss: 0.5673
Epoch 67/100
35/35 [==============================] - 80s 2s/step - loss: 0.2594 - val_loss: 0.5776
Epoch 68/100
35/35 [==============================] - 80s 2s/step - loss: 0.2684 - val_loss: 0.5378
Epoch 69/100
35/35 [==============================] - 79s 2s/step - loss: 0.2513 - val_loss: 0.5800
Epoch 70/100
35/35 [==============================] - 79s 2s/step - loss: 0.2608 - val_loss: 0.5649
Epoch 71/100
35/35 [==============================] - 79s 2s/step - loss: 0.2575 - val_loss: 0.5743
Epoch 72/100
35/35 [==============================] - 79s 2s/step - loss: 0.2472 - val_loss: 0.5785
Epoch 73/100
35/35 [==============================] - 80s 2s/step - loss: 0.2467 - val_loss: 0.5946
Epoch 74/100
35/35 [==============================] - 80s 2s/step - loss: 0.2508 - val_loss: 0.5819
Epoch 75/100
35/35 [==============================] - 79s 2s/step - loss: 0.2391 - val_loss: 0.5895
Epoch 76/100
35/35 [==============================] - 79s 2s/step - loss: 0.2457 - val_loss: 0.6126
Epoch 77/100
35/35 [==============================] - 79s 2s/step - loss: 0.2524 - val_loss: 0.5570
Epoch 78/100
35/35 [==============================] - 79s 2s/step - loss: 0.2491 - val_loss: 0.5488
Epoch 79/100
35/35 [==============================] - 80s 2s/step - loss: 0.2363 - val_loss: 0.5979
Epoch 80/100
35/35 [==============================] - 74s 2s/step - loss: 0.2369 - val_loss: 0.5644
Epoch 81/100
35/35 [==============================] - 72s 2s/step - loss: 0.2302 - val_loss: 0.5835
Epoch 82/100
35/35 [==============================] - 80s 2s/step - loss: 0.2288 - val_loss: 0.5836
Epoch 83/100
35/35 [==============================] - 79s 2s/step - loss: 0.2361 - val_loss: 0.5878
Epoch 84/100
35/35 [==============================] - 79s 2s/step - loss: 0.2250 - val_loss: 0.5866
Epoch 85/100
35/35 [==============================] - 79s 2s/step - loss: 0.2225 - val_loss: 0.5966
Epoch 86/100
35/35 [==============================] - 79s 2s/step - loss: 0.2188 - val_loss: 0.5691
Epoch 87/100
35/35 [==============================] - 79s 2s/step - loss: 0.2305 - val_loss: 0.5506
Epoch 88/100
35/35 [==============================] - 79s 2s/step - loss: 0.2175 - val_loss: 0.5933
Epoch 89/100
35/35 [==============================] - 79s 2s/step - loss: 0.2142 - val_loss: 0.6137
Epoch 90/100
35/35 [==============================] - 81s 2s/step - loss: 0.2165 - val_loss: 0.6676
Epoch 91/100
35/35 [==============================] - 79s 2s/step - loss: 0.2192 - val_loss: 0.5979
Epoch 92/100
35/35 [==============================] - 79s 2s/step - loss: 0.2165 - val_loss: 0.6181
Epoch 93/100
35/35 [==============================] - 79s 2s/step - loss: 0.2194 - val_loss: 0.6285
Epoch 94/100
35/35 [==============================] - 71s 2s/step - loss: 0.2126 - val_loss: 0.6091
Epoch 95/100
35/35 [==============================] - 70s 2s/step - loss: 0.1987 - val_loss: 0.6469
Epoch 96/100
35/35 [==============================] - 71s 2s/step - loss: 0.2122 - val_loss: 0.6011
Epoch 97/100
35/35 [==============================] - 71s 2s/step - loss: 0.2060 - val_loss: 0.6016
Epoch 98/100
35/35 [==============================] - 71s 2s/step - loss: 0.2118 - val_loss: 0.6209
Epoch 99/100
35/35 [==============================] - 79s 2s/step - loss: 0.2053 - val_loss: 0.6138
Epoch 100/100
35/35 [==============================] - 79s 2s/step - loss: 0.2063 - val_loss: 0.6295
Out[77]:
<tensorflow.python.keras.callbacks.History at 0x1f39510ddc8>
In [ ]:
# A simple LSTM with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])
 

이제 매우 가까워져 간다. GRU 두개 레이어를 더 해보자

In [ ]:
# GRU with glove embeddings and two dense layers

model = Sequential()
model.add(Embedding(len(word_index) + 1, 300
                   weights=[embedding_matrix],
                   input_length=max_len,
                   trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0,
                         mode='auto')
model.fit(xtrain_pad, y =ytrain_enc, batch_size=512,epochs=100,
         verbose=1,validation_data=(xvalid_pad, yvalid_enc), callback=[earlystop])
 
 
직접 돌리진 않았지만, 우리가 이전에 가졌던것보다 훨씬더 좋아질 것이다. 계속 최적화하면 성능이 향상될 것이다. stemming, lemmatization는 충분히 시도할 만하다. (지금은 진행하진 않겠지만..) 캐글에서는 최고 점수를 받으려면 앙상블 모델이 있어야한다. 앙상블을 확인해보자. 이것은 이 [사이트](https://github.com/abhishekkrthakur/pysembler)를 참고하면된다.

 


github 소스코드 바로가기