- Geospatial Data Science Quick Start Guide
- Abdishakur Hassan Jayakrishnan Vijayaraghavan
- 296字
- 2021-06-24 13:48:20
Handling missing values
A machine learning algorithm such as random forest can handle a few missing values very well, and in some cases we can adopt strategies such as imputing or removing rows with missing values. But if the proportion of missing values in a column is pretty high, we might need to remove entire columns. The following lines of code help us determine the percentage of missing values in each column of the data:
na_counts = pd.DataFrame(df.isna().sum()/len(df))
na_counts.columns = ["null_row_pct"]
na_counts[na_counts.null_row_pct > 0].sort_values(by = "null_row_pct", ascending=False)
The resulting DataFrame looks as follows:

At first glance, we might be inclined to remove all rows that have missing latitude or longitude values for pickup and dropoff, since we identified that this is the major feature we will be building our model upon. But when taking closer look, we can see that the percentage of missing values for the PULocationID or DOLocationID columns and Pickup_longitude/Pickup_latitude and Dropoff_longitude/Dropoff_latitude are exact complements of each other. This means that the sum of the percentage values of entities; taking one from each group is exactly 100%. As a corollary, we can infer that for each missing value in pickup or dropoff coordinates, there is a non-missing value in the corresponding rows for PULocationID or DOLocationID.
But what are these location IDs? These location IDs are the taxi zone IDs that are assigned to different locations in New York. Though these locations are areal features, we can calculate the centroid of these locations and substitute these for the pickup and dropoff location coordinates. But when both the location ID and coordinates are missing, we need to remove those rows. The following lines of code will accomplish this:
df = df[~(
(df.Dropoff_latitude.isna()) & (df.DOLocationID.isna())
)]
- Introduction to DevOps with Kubernetes
- 腦動(dòng)力:C語言函數(shù)速查效率手冊(cè)
- 計(jì)算機(jī)控制技術(shù)
- Julia 1.0 Programming
- 空間傳感器網(wǎng)絡(luò)復(fù)雜區(qū)域智能監(jiān)測(cè)技術(shù)
- STM32G4入門與電機(jī)控制實(shí)戰(zhàn):基于X-CUBE-MCSDK的無刷直流電機(jī)與永磁同步電機(jī)控制實(shí)現(xiàn)
- 21天學(xué)通Visual Basic
- Ceph:Designing and Implementing Scalable Storage Systems
- Linux嵌入式系統(tǒng)開發(fā)
- 軟件工程及實(shí)踐
- INSTANT Puppet 3 Starter
- 生物3D打印:從醫(yī)療輔具制造到細(xì)胞打印
- ZigBee無線通信技術(shù)應(yīng)用開發(fā)
- Linux常用命令簡(jiǎn)明手冊(cè)
- 玩轉(zhuǎn)機(jī)器人:基于Proteus的電路原理仿真(移動(dòng)視頻版)