Competition/Kaggle

[kaggle][필사] New York City Taxi Duration (1)

bisi 2020. 10. 2. 11:14

 

이번 필사 주제는 New York City Taxi Duration 이다.

 

이 대회는 뉴욕시에서 택시 여행의 총 승차 시간을 예측하는 모델을 구축하는 것이 목표이다.

 

AiswaryaRamachandran님의 커널을 참고하여 필사를 진행했다. 

 

 

 

목록

 

New York City Taxi Duration (1)

더보기

1. 데이터 분석 준비

  1) data description

 

2. 데이터 살펴보기

  1) missing data 찾기

  2) 분석을 위한 새로운 컬럼 생성

 

New York City Taxi Duration (2)

더보기

3. Exploratory Data Analysis

1) HeatMap

2) 시간, 요일

3) 거리, 지역, 속도

 

New York City Taxi Duration (3)

더보기

4. Feature Engineering

5. 모델 적용

  1) 모델 세우기

  2) 선형 모델 적용

  3) 랜덤포레스트 적용

 

 

 


 

 

 

 

 

 

 

 

 

 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
from datetime import datetime
import calendar
from math import sin, cos, sqrt, atan2, radians
from folium import FeatureGroup, LayerControl, Map, Marker
from folium.plugins import HeatMap
import matplotlib.dates as mdates
import matplotlib as mpl
from datetime import timedelta
import datetime as dt
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', -1)
plt.style.use('fivethirtyeight')
import folium
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import pickle
 

1. 데이터 분석 준비

In [2]:
# 데이터 불러오기 
train = pd.read_csv("../data/nyc-taxi-trip-duration/train.csv")
test = pd.read_csv("../data/nyc-taxi-trip-duration/test.csv")
In [3]:
print('train shape : ', train.shape, 'test shape : ', test.shape)
 
train shape :  (1458644, 11) test shape :  (625134, 9)
In [4]:
train.head()
Out[4]:
  id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N 2124
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N 429
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N 435
 

data description

  • id - 각 트립에 대한 고유 식별자
  • vendor_id - 주행 기록과 연결된 제공자를 나타내는 코드
  • pick_datetime - 미터기가 작동된 날짜 및 시간
  • dropoff_datetime - 미터기가 해제된 날짜 및 시간
  • passenger_count - 차량에 탑승한 승객 수(운전자 입력 값)
  • pickup_longitude - 미터기가 걸려 있던 경도
  • pickup_latitude - 미터기가 작동된 위도
  • dropoff_longitude - 미터기가 해제된 경도
  • dropoff_latitude - 미터기가 해제된 위도
  • store_and_fwd_flag - 이 플래그는 차량이 서버와 연결되지 않았기 때문에 공급업체에 전송하기 전에 트립 레코드를 차량 메모리에 보관했는지 여부를 표시함
    • Y=store and forward; N=store 및 Forward trip
  • trip_timeout - 여행 기간(초)
In [5]:
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'], format='%Y-%m-%d %H:%M:%S')
train['dropoff_datetime'] = pd.to_datetime(train['dropoff_datetime'], format='%Y-%m-%d %H:%M:%S')
train.head()
Out[5]:
  id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N 2124
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N 429
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N 435
 

2. 데이터 살펴보기

 

1) missing data 찾기

In [6]:
train[pd.isnull(train)].sum()
### dataset의 길이
print("Min pickup time:", min(train['pickup_datetime']))
print("Max pickup time:", max(train['pickup_datetime']))
 
Min pickup time: 2016-01-01 00:00:17
Max pickup time: 2016-06-30 23:59:39
 

2) 분석을 위한 새로운 컬럼 생성

 

pickup time에서 day, month, hour 정보를 생성

In [7]:
train['pickup_date'] = train['pickup_datetime'].dt.date
train['pickup_day'] = train['pickup_datetime'].apply(lambda x:x.day)
train['pickup_hour'] = train['pickup_datetime'].apply(lambda x:x.hour)
train['pickup_day_of_week'] = train['pickup_datetime'].apply(lambda x:calendar.day_name[x.weekday()])

train['dropoff_date'] = train['dropoff_datetime'].dt.date
train['dropoff_day'] = train['dropoff_datetime'].apply(lambda x:x.day)
train['dropoff_hour'] = train['dropoff_datetime'].apply(lambda x:x.hour)
train['dropoff_day_of_week'] = train['dropoff_datetime'].apply(lambda x:calendar.day_name[x.weekday()])
In [8]:
train.head()
Out[8]:
  id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration pickup_date pickup_day pickup_hour pickup_day_of_week dropoff_date dropoff_day dropoff_hour dropoff_day_of_week
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455 2016-03-14 14 17 Monday 2016-03-14 14 17 Monday
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663 2016-06-12 12 0 Sunday 2016-06-12 12 0 Sunday
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N 2124 2016-01-19 19 11 Tuesday 2016-01-19 19 12 Tuesday
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N 429 2016-04-06 6 19 Wednesday 2016-04-06 6 19 Wednesday
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N 435 2016-03-26 26 13 Saturday 2016-03-26 26 13 Saturday
In [9]:
## 위도 경도 변수 소수점 이하 3자리까지 반올림
train['pickup_latitude_round3']=train['pickup_latitude'].apply(lambda x:round(x,3))
train['pickup_longitude_round3']=train['pickup_longitude'].apply(lambda x:round(x,3))
train['dropoff_latitude_round3']=train['dropoff_latitude'].apply(lambda x:round(x,3))
train['dropoff_longitude_round3']=train['dropoff_longitude'].apply(lambda x:round(x,3))
In [10]:
train.head()
Out[10]:
  id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag ... pickup_hour pickup_day_of_week dropoff_date dropoff_day dropoff_hour dropoff_day_of_week pickup_latitude_round3 pickup_longitude_round3 dropoff_latitude_round3 dropoff_longitude_round3
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N ... 17 Monday 2016-03-14 14 17 Monday 40.768 -73.982 40.766 -73.965
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N ... 0 Sunday 2016-06-12 12 0 Sunday 40.739 -73.980 40.731 -73.999
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N ... 11 Tuesday 2016-01-19 19 12 Tuesday 40.764 -73.979 40.710 -74.005
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N ... 19 Wednesday 2016-04-06 6 19 Wednesday 40.720 -74.010 40.707 -74.012
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N ... 13 Saturday 2016-03-26 26 13 Saturday 40.793 -73.973 40.783 -73.973

5 rows × 23 columns

 

위도 경도를 사용하여 km 단위 거리로 계산하기

In [11]:
def calculateDistance(row):
    R=6376.0 # 지구의 대략적인 반경(km)
    pickup_lat = radians(row['pickup_latitude'])
    pickup_lon = radians(row['pickup_longitude'])
    dropoff_lat = radians(row['dropoff_latitude'])
    dropoff_lon = radians(row['dropoff_longitude'])
    dlon = dropoff_lon -pickup_lon
    dlat = dropoff_lat -pickup_lat
    
    a=sin(dlat /2) **2 +cos(pickup_lat)*cos(dropoff_lat)*sin(dlon/2)**2
    c=2*atan2(sqrt(a), sqrt(1-a))
    distance=R*c
    return distance
    
In [12]:
train['trip_distance'] = train.apply(lambda row:calculateDistance(row), axis=1)
train.head()
Out[12]:
  id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag ... pickup_day_of_week dropoff_date dropoff_day dropoff_hour dropoff_day_of_week pickup_latitude_round3 pickup_longitude_round3 dropoff_latitude_round3 dropoff_longitude_round3 trip_distance
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N ... Monday 2016-03-14 14 17 Monday 40.768 -73.982 40.766 -73.965 1.499697
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N ... Sunday 2016-06-12 12 0 Sunday 40.739 -73.980 40.731 -73.999 1.806924
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N ... Tuesday 2016-01-19 19 12 Tuesday 40.764 -73.979 40.710 -74.005 6.390110
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N ... Wednesday 2016-04-06 6 19 Wednesday 40.720 -74.010 40.707 -74.012 1.486664
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N ... Saturday 2016-03-26 26 13 Saturday 40.793 -73.973 40.783 -73.973 1.189521

5 rows × 24 columns

In [13]:
train['trip_duration_in_hour']=train['trip_duration'].apply(lambda x:x/3600)
train.head()
Out[13]:
  id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag ... dropoff_date dropoff_day dropoff_hour dropoff_day_of_week pickup_latitude_round3 pickup_longitude_round3 dropoff_latitude_round3 dropoff_longitude_round3 trip_distance trip_duration_in_hour
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N ... 2016-03-14 14 17 Monday 40.768 -73.982 40.766 -73.965 1.499697 0.126389
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N ... 2016-06-12 12 0 Sunday 40.739 -73.980 40.731 -73.999 1.806924 0.184167
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N ... 2016-01-19 19 12 Tuesday 40.764 -73.979 40.710 -74.005 6.390110 0.590000
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N ... 2016-04-06 6 19 Wednesday 40.720 -74.010 40.707 -74.012 1.486664 0.119167
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N ... 2016-03-26 26 13 Saturday 40.793 -73.973 40.783 -73.973 1.189521 0.120833

5 rows × 25 columns

 


github 소스코드 바로가기