Competition/Kaggle

[kaggle][필사] New York City Taxi Duration (3)

bisi 2020. 10. 5. 11:17

 

이번 필사 주제는 New York City Taxi Duration 이다.

 

이 대회는 뉴욕시에서 택시 여행의 총 승차 시간을 예측하는 모델을 구축하는 것이 목표이다.

 

AiswaryaRamachandran님의 커널을 참고하여 필사를 진행했다. 

 

 

 

목록

 

New York City Taxi Duration (1)

더보기

1. 데이터 분석 준비

  1) data description

 

2. 데이터 살펴보기

  1) missing data 찾기

  2) 분석을 위한 새로운 컬럼 생성

 

New York City Taxi Duration (2)

더보기

3. Exploratory Data Analysis

1) HeatMap

2) 시간, 요일

3) 거리, 지역, 속도

 

New York City Taxi Duration (3)

더보기

4. Feature Engineering

5. 모델 적용

  1) 모델 세우기

  2) 선형 모델 적용

  3) 랜덤포레스트 적용

 

 

 


 

 

 

 

 

4. Feature Engineering

모델 적용을 위해 test data를 featrue engineering을 한다.

In [56]:
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'], format='%Y-%m-%d %H:%M:%S')
test['pickup_date'] = test['pickup_datetime'].dt.date
test['pickup_day'] = test['pickup_datetime'].apply(lambda x:x.day)
test['pickup_hour'] = test['pickup_datetime'].apply(lambda x:x.hour)
test['pickup_day_of_week'] = test['pickup_datetime'].apply(lambda x:calendar.day_name[x.weekday()])

test['pickup_latitude_round3']=test['pickup_latitude'].apply(lambda x:round(x,3))
test['pickup_longitude_round3']=test['pickup_longitude'].apply(lambda x:round(x,3))
test['dropoff_latitude_round3']=test['dropoff_latitude'].apply(lambda x:round(x,3))
test['dropoff_longitude_round3']=test['dropoff_longitude'].apply(lambda x:round(x,3))
test['trip_distance'] = test.apply(lambda row : calculateDistance(row), axis=1)

test['bearing']=test.apply(lambda row:calculateBearing(row['pickup_latitude_round3'], 
                                                       row['pickup_longitude_round3'], row['dropoff_latitude_round3'], row['dropoff_longitude_round3']), axis=1)
test.loc[:,'pickup_neighbourhood'] = kmeans.predict(test[['pickup_latitude', 'pickup_longitude']])
test.loc[:,'dropoff_neighbourhood'] = kmeans.predict(test[['dropoff_latitude', 'dropoff_longitude']])
In [57]:
drop_cols=['avg_speed_kph','trip_duration_in_hour','dropoff_date','dropoff_day','dropoff_hour','dropoff_day_of_week','dropoff_datetime','pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']
training = train.drop(drop_cols, axis=1)
testing=test.drop(['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude'], axis=1)
In [58]:
training
Out[58]:
  id vendor_id pickup_datetime passenger_count store_and_fwd_flag trip_duration pickup_date pickup_day pickup_hour pickup_day_of_week pickup_latitude_round3 pickup_longitude_round3 dropoff_latitude_round3 dropoff_longitude_round3 trip_distance bearing pickup_neighbourhood dropoff_neighbourhood
0 id2875421 2 2016-03-14 17:24:55 1 N 455 2016-03-14 14 17 Monday 40.768 -73.982 40.766 -73.965 1.499697 98.823984 0 6
1 id2377394 1 2016-06-12 00:43:35 1 N 663 2016-06-12 12 0 Sunday 40.739 -73.980 40.731 -73.999 1.806924 -119.053505 0 3
2 id3858529 2 2016-01-19 11:35:24 1 N 2124 2016-01-19 19 11 Tuesday 40.764 -73.979 40.710 -74.005 6.390110 -159.948291 0 3
3 id3504673 2 2016-04-06 19:32:31 1 N 429 2016-04-06 6 19 Wednesday 40.720 -74.010 40.707 -74.012 1.486664 -173.347990 3 3
4 id2181028 2 2016-03-26 13:30:55 1 N 435 2016-03-26 26 13 Saturday 40.793 -73.973 40.783 -73.973 1.189521 180.000000 6 6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1458639 id2376096 2 2016-04-08 13:31:04 4 N 778 2016-04-08 8 13 Friday 40.746 -73.982 40.740 -73.995 1.226042 -121.344499 0 3
1458640 id1049543 1 2016-01-10 07:35:15 1 N 655 2016-01-10 10 7 Sunday 40.747 -74.001 40.797 -73.970 6.054584 25.141571 0 6
1458641 id2304944 2 2016-04-22 06:57:41 1 N 764 2016-04-22 22 6 Friday 40.769 -73.959 40.707 -74.004 7.830747 -151.176950 6 3
1458642 id2714485 1 2016-01-05 15:56:26 1 N 373 2016-01-05 5 15 Tuesday 40.749 -73.982 40.757 -73.975 1.093421 33.535703 0 0
1458643 id1209952 1 2016-04-05 14:44:25 1 N 198 2016-04-05 5 14 Tuesday 40.782 -73.980 40.791 -73.973 1.134932 30.491275 6 6

1458644 rows × 18 columns

In [59]:
training['log_trip_duration'] = training['trip_duration'].apply(lambda x : np.log(x))
training.drop(['trip_duration'], axis=1, inplace=True)
In [60]:
print("Training Data Shape", training.shape)
print("Testing Data Shape", testing.shape)
 
Training Data Shape (1458644, 18)
Testing Data Shape (625134, 17)
 

일주일의 요일을 숫자로 encode 해보자.

In [61]:
def encodeDay(day_of_week):
    day_dict = {'Sunday' :0, 'Monday':1, 'Tuesday':2 ,'Wednesday':3, 'Thursday':4, 'Friday':5, 'Saturday':6}
    return day_dict[day_of_week]
In [62]:
training['pickup_day_of_week'] = training['pickup_day_of_week'].apply(lambda x:encodeDay(x))
testing['pickup_day_of_week'] = testing['pickup_day_of_week'].apply(lambda x:encodeDay(x))
In [63]:
training.to_csv("input_training.csv", index=False)
testing.to_csv("input_testing.csv", index=False)

del training
del testing
del train
del test
In [65]:
def LabelEncoding(train_df, test_df, max_levels=2):
    for col in train_df:
        if len(list(train_df[col].unique())) <= max_levels:
            le = preprocessing.LabelEncoder()
            le.fit(train_df[col])
            train_df[col] = le.transform(train_df[col])
            test_df[col] = le.transform(test_df[col])
    return [train_df, test_df]

def readInputAndEncode(input_path, train_file, test_file, target_column):
    training = pd.read_csv(input_path+train_file)
    testing = pd.read_csv(input_path+test_file)
    
    training, testing = LabelEncoding(training, testing)
    
    print("Training Data Shape after Encoding ", training.shape)
    print("Testing Data Shape after Encoding ", testing.shape)
    
    train_cols = training.columns.tolist()
    test_cols = testing.columns.tolist()
    col_in_train_not_test = set(train_cols) -set(test_cols)
    
    for col in col_in_train_not_test:
        if col!=target_column:
            testing[col] = 0
    col_in_test_not_train = set(test_cols) -set(train_cols)
    for col in col_in_test_not_train:
        training[col]=0
    print("---------------")
    print("Training Data Shape after Processing ", training.shape)
    print("Testing Data Shape after Processing ", testing.shape)
    
    return [training, testing]   
    
In [66]:
train, test = readInputAndEncode("", 'input_training.csv', 'input_testing.csv', 'log_trip_duration')
train.drop(['pickup_date'], axis=1, inplace=True)
test.drop(['pickup_date'], axis=1, inplace=True)
train.drop(['pickup_datetime'], axis=1, inplace=True)
test.drop(['pickup_datetime'], axis=1, inplace=True)
test_id = test['id']
train.drop(['id'], axis=1, inplace=True)
test.drop(['id'], axis=1, inplace=True)
 
Training Data Shape after Encoding  (1458644, 18)
Testing Data Shape after Encoding  (625134, 17)
---------------
Training Data Shape after Processing  (1458644, 18)
Testing Data Shape after Processing  (625134, 17)
 

5. 모델 적용

 

1) 모델 세우기

In [72]:
def GetFeatureAndSplit(train, test, target, imputing_strategy='median', split=0.25, imputation=True ):
    labels=np.array(train[target])
    training = train.drop(target, axis=1)
    training = np.array(training)
    testing = np.array(test)
    
    if imputation == True:
        imputer = SimpleImputer(strategy=imputing_strategy, missing_values=np.nan)
        imputer.fit(training)
        training = imputer.transform(training)
        testing = imputer.transform(testing)
    
    train_features, validation_features, train_labels, validation_labels = train_test_split(training, labels, test_size= split, random_state=42)
    return [train_features, validation_features, train_labels, validation_labels, testing]
In [76]:
train_features, validation_features, train_labels, validation_labels, testing=GetFeatureAndSplit(train, test, 'log_trip_duration', imputation=True)
 

2) 선형 모델 적용

In [77]:
lm = linear_model.LinearRegression()
lm.fit(train_features, train_labels)
Out[77]:
LinearRegression()
In [79]:
valid_pred = lm.predict(validation_features)
In [80]:
rmse = mean_squared_error(validation_labels, valid_pred)
print("Root Mean Squared Error for Linear Regression(log scale)", rmse)
 
Root Mean Squared Error for Linear Regression(log scale) 0.4003474301960613
In [81]:
test_pred = lm.predict(testing)
submit= pd.DataFrame()
submit['id'] = test_id
submit['trip_duration'] = np.exp(test_pred)
submit.to_csv("submission_linear_regression_baseline.csv", index = False)
 

3) 랜덤포레스트 적용

In [82]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(train_features, train_labels)
Out[82]:
RandomForestRegressor(random_state=42)
In [83]:
valid_pred_rf= rf.predict(validation_features)
rmse=mean_squared_error(validation_labels, valid_pred_rf)
print("Root Mean Squared Error for Random Forest", rmse)
 
Root Mean Squared Error for Random Forest 0.16593325812733248
In [84]:
test_pred=rf.predict(testing)
submit = pd.DataFrame()
submit['id'] = test_id
submit['trip_duration'] = np.exp(test_pred)
submit.to_csv("submission_random_forest_baseline.csv", index=False)

 


github 소스코드 바로가기