官术网_书友最值得收藏!

Handling time values

Time values play an important role in our model because time is both a feature and a target (value to be predicted) in our model. First, we need to convert the pickup and dropoff times into pandas datetime values to calculate the target value, which will be the natural log of the difference in time between dropoff and pickup in seconds:

df["trip_duration"] = np.log((df.Lpep_dropoff_datetime - df.lpep_pickup_datetime).dt.seconds + 1)

In the preceding line of code, we are adding 1 second to the trip duration to prevent an undefined error when a log transformation is applied over the value.

But why are we using natural log transformation over the trip duration? There are three reasons for this, as follows:

  • For the Kaggle competition on New York taxi trip duration prediction, the evaluation metric is defined as the Root Mean Squared Logarithmic Error (RMSLE). When log transformation is applied and the RMSE is calculated over the target values, we get the RMSLE. This helps us compare our results with the best-performing teams. 
  • Errors in log scale let us know by how many factors we were wrong, for example, whether we were 10% off from the actual values or 70% off. We will be discussing this in detail when we look at the Error metric section.
  • The log transformation over the target variable follows a perfectly normal distribution. This satisfies one of the assumptions of linear regression. The plot of the trip duration values (on a log scale) looks as follows:
主站蜘蛛池模板: 灵石县| 石台县| 日土县| 上蔡县| 清水河县| 铜梁县| 刚察县| 永寿县| 沙田区| 廊坊市| 慈溪市| 平阴县| 罗源县| 元江| 西安市| 宝丰县| 蕉岭县| 崇左市| 东安县| 交口县| 板桥市| 洪雅县| 东平县| 彩票| 广元市| 安义县| 新巴尔虎左旗| 天台县| 崇信县| 上犹县| 长兴县| 永顺县| 沙坪坝区| 汽车| 镇平县| 威信县| 喀喇沁旗| 永康市| 揭东县| 皋兰县| 嘉兴市|