이번 필사 주제는 New York City Taxi Duration 이다.
이 대회는 뉴욕시에서 택시 여행의 총 승차 시간을 예측하는 모델을 구축하는 것이 목표이다.
AiswaryaRamachandran님의 커널을 참고하여 필사를 진행했다.
목록
New York City Taxi Duration (1)
더보기
1. 데이터 분석 준비
1) data description
2. 데이터 살펴보기
1) missing data 찾기
2) 분석을 위한 새로운 컬럼 생성
New York City Taxi Duration (2)
더보기
3. Exploratory Data Analysis
1) HeatMap
2) 시간, 요일
3) 거리, 지역, 속도
New York City Taxi Duration (3)
더보기
4. Feature Engineering
5. 모델 적용
1) 모델 세우기
2) 선형 모델 적용
3) 랜덤포레스트 적용
4. Feature Engineering
모델 적용을 위해 test data를 featrue engineering을 한다.
In [56]:
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'], format='%Y-%m-%d %H:%M:%S')
test['pickup_date'] = test['pickup_datetime'].dt.date
test['pickup_day'] = test['pickup_datetime'].apply(lambda x:x.day)
test['pickup_hour'] = test['pickup_datetime'].apply(lambda x:x.hour)
test['pickup_day_of_week'] = test['pickup_datetime'].apply(lambda x:calendar.day_name[x.weekday()])
test['pickup_latitude_round3']=test['pickup_latitude'].apply(lambda x:round(x,3))
test['pickup_longitude_round3']=test['pickup_longitude'].apply(lambda x:round(x,3))
test['dropoff_latitude_round3']=test['dropoff_latitude'].apply(lambda x:round(x,3))
test['dropoff_longitude_round3']=test['dropoff_longitude'].apply(lambda x:round(x,3))
test['trip_distance'] = test.apply(lambda row : calculateDistance(row), axis=1)
test['bearing']=test.apply(lambda row:calculateBearing(row['pickup_latitude_round3'],
row['pickup_longitude_round3'], row['dropoff_latitude_round3'], row['dropoff_longitude_round3']), axis=1)
test.loc[:,'pickup_neighbourhood'] = kmeans.predict(test[['pickup_latitude', 'pickup_longitude']])
test.loc[:,'dropoff_neighbourhood'] = kmeans.predict(test[['dropoff_latitude', 'dropoff_longitude']])
In [57]:
drop_cols=['avg_speed_kph','trip_duration_in_hour','dropoff_date','dropoff_day','dropoff_hour','dropoff_day_of_week','dropoff_datetime','pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']
training = train.drop(drop_cols, axis=1)
testing=test.drop(['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude'], axis=1)
In [58]:
training
Out[58]:
In [59]:
training['log_trip_duration'] = training['trip_duration'].apply(lambda x : np.log(x))
training.drop(['trip_duration'], axis=1, inplace=True)
In [60]:
print("Training Data Shape", training.shape)
print("Testing Data Shape", testing.shape)
일주일의 요일을 숫자로 encode 해보자.
In [61]:
def encodeDay(day_of_week):
day_dict = {'Sunday' :0, 'Monday':1, 'Tuesday':2 ,'Wednesday':3, 'Thursday':4, 'Friday':5, 'Saturday':6}
return day_dict[day_of_week]
In [62]:
training['pickup_day_of_week'] = training['pickup_day_of_week'].apply(lambda x:encodeDay(x))
testing['pickup_day_of_week'] = testing['pickup_day_of_week'].apply(lambda x:encodeDay(x))
In [63]:
training.to_csv("input_training.csv", index=False)
testing.to_csv("input_testing.csv", index=False)
del training
del testing
del train
del test
In [65]:
def LabelEncoding(train_df, test_df, max_levels=2):
for col in train_df:
if len(list(train_df[col].unique())) <= max_levels:
le = preprocessing.LabelEncoder()
le.fit(train_df[col])
train_df[col] = le.transform(train_df[col])
test_df[col] = le.transform(test_df[col])
return [train_df, test_df]
def readInputAndEncode(input_path, train_file, test_file, target_column):
training = pd.read_csv(input_path+train_file)
testing = pd.read_csv(input_path+test_file)
training, testing = LabelEncoding(training, testing)
print("Training Data Shape after Encoding ", training.shape)
print("Testing Data Shape after Encoding ", testing.shape)
train_cols = training.columns.tolist()
test_cols = testing.columns.tolist()
col_in_train_not_test = set(train_cols) -set(test_cols)
for col in col_in_train_not_test:
if col!=target_column:
testing[col] = 0
col_in_test_not_train = set(test_cols) -set(train_cols)
for col in col_in_test_not_train:
training[col]=0
print("---------------")
print("Training Data Shape after Processing ", training.shape)
print("Testing Data Shape after Processing ", testing.shape)
return [training, testing]
In [66]:
train, test = readInputAndEncode("", 'input_training.csv', 'input_testing.csv', 'log_trip_duration')
train.drop(['pickup_date'], axis=1, inplace=True)
test.drop(['pickup_date'], axis=1, inplace=True)
train.drop(['pickup_datetime'], axis=1, inplace=True)
test.drop(['pickup_datetime'], axis=1, inplace=True)
test_id = test['id']
train.drop(['id'], axis=1, inplace=True)
test.drop(['id'], axis=1, inplace=True)
5. 모델 적용
1) 모델 세우기
In [72]:
def GetFeatureAndSplit(train, test, target, imputing_strategy='median', split=0.25, imputation=True ):
labels=np.array(train[target])
training = train.drop(target, axis=1)
training = np.array(training)
testing = np.array(test)
if imputation == True:
imputer = SimpleImputer(strategy=imputing_strategy, missing_values=np.nan)
imputer.fit(training)
training = imputer.transform(training)
testing = imputer.transform(testing)
train_features, validation_features, train_labels, validation_labels = train_test_split(training, labels, test_size= split, random_state=42)
return [train_features, validation_features, train_labels, validation_labels, testing]
In [76]:
train_features, validation_features, train_labels, validation_labels, testing=GetFeatureAndSplit(train, test, 'log_trip_duration', imputation=True)
2) 선형 모델 적용
In [77]:
lm = linear_model.LinearRegression()
lm.fit(train_features, train_labels)
Out[77]:
In [79]:
valid_pred = lm.predict(validation_features)
In [80]:
rmse = mean_squared_error(validation_labels, valid_pred)
print("Root Mean Squared Error for Linear Regression(log scale)", rmse)
In [81]:
test_pred = lm.predict(testing)
submit= pd.DataFrame()
submit['id'] = test_id
submit['trip_duration'] = np.exp(test_pred)
submit.to_csv("submission_linear_regression_baseline.csv", index = False)
3) 랜덤포레스트 적용
In [82]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(train_features, train_labels)
Out[82]:
In [83]:
valid_pred_rf= rf.predict(validation_features)
rmse=mean_squared_error(validation_labels, valid_pred_rf)
print("Root Mean Squared Error for Random Forest", rmse)
In [84]:
test_pred=rf.predict(testing)
submit = pd.DataFrame()
submit['id'] = test_id
submit['trip_duration'] = np.exp(test_pred)
submit.to_csv("submission_random_forest_baseline.csv", index=False)
'Competition > Kaggle' 카테고리의 다른 글
[kaggle][필사] TensorFlow Speech Recognition Challenge (1) (0) | 2020.10.08 |
---|---|
[kaggle][필사] New York City Taxi Duration (2) (0) | 2020.10.04 |
[kaggle][필사] New York City Taxi Duration (1) (0) | 2020.10.02 |