- Python Data Analysis(Second Edition)
- Armando Fandango
- 537字
- 2021-07-09 19:04:08
Handling missing values
We regularly encounter empty fields in data records. It's best that we accept this and learn how to handle this kind of issue in a robust manner. Real data can not only have gaps-it can also have wrong values, because of faulty measuring equipment, for example. In Pandas, missing numerical values will be designated as NaN
, objects as None
, and the datetime64
objects as NaT
. The outcome of arithmetic operations with NaN
values is also NaN
. Descriptive statistics methods, such as summation and average, behave differently. As we observed in an earlier example, in such a case, NaN
values are treated as zero values. However, if all the values are NaN
during, say, summation, the sum returned is still NaN
. In aggregation operations, NaN
values in the column that we group are ignored. We will again load the WHO_first9cols.csv
file into a DataFrame. Remember that this file contains empty fields. Let's only select the first three rows, including the headers of the Country
and Net primary school enrolment ratio male (%)
columns as follows:
df = df[['Country', df.columns[-2]]][:2] print("New df\n", df)
We get a DataFrame with two NaN values:
New df Country Net primary school enrolment ratio male (%) 0 Afghanistan NaN 1 Albania 94 [2 rows x 2 columns]
The Pandas isnull()
function checks for missing values as follows:
print("Null Values\n", pd.isnull(df))
The output for our DataFrame is as follows:
Null Values Country Net primary school enrolment ratio male (%) 0 False True 1 False False
To count the number of NaN
values for each column, we can sum the Boolean values returned by isnull()
. This works because, during summation, True
values are considered as ones and False
values are treated as zeros:
Total Null Values Country 0 Net primary school enrolment ratio male (%) 1 dtype: int64
Likewise, we can check with the DataFrame notnull()
method for any non-missing values that are present:
print("Not Null Values\n", df.notnull())
The result of the notnull()
method is the opposite of the isnull()
function:
Not Null Values Country Net primary school enrolment ratio male (%) 0 True False 1 True True
When we double values in a DataFrame that has NaN
values, the product will still contain NaN
values, since doubling is an arithmetic operation:
print("Last Column Doubled\n", 2 * df[df.columns[-1]])
We double the last column, which contains numerical values (doubling string values repeats the string):
Last Column Doubled 0 NaN 1 188 Name: Net primary school enrolment ratio male (%), dtype: float64
If we add a NaN
value, however, the NaN
value wins:
print("Last Column plus NaN\n", df[df.columns[-1]] + np.nan)
As you can see, the NaN
values declared total victory:
Last Column plus NaN 0 NaN1 NaN Name: Net primary school enrolment ratio male (%), dtype: float64
Replace the missing values with a scalar value. For example, replace 0
with the fillna()
method (we can't always replace missing values with zeros, but sometimes this is good enough):
print("Zero filled\n", df.fillna(0))
The effect of the preceding line is to replace the NaN value with 0:
Zero filled Country Net primary school enrolment ratio male (%) 0 Afghanistan 0 1 Albania 94
The code for this section is in the ch-03.ipynb
file of this book's code bundle:
- 國際大學生程序設計競賽中山大學內部選拔真題解(二)
- Oracle 11g從入門到精通(第2版) (軟件開發視頻大講堂)
- Web Scraping with Python
- 零基礎玩轉區塊鏈
- 深入淺出WPF
- Python Game Programming By Example
- SEO實戰密碼
- 小程序,巧應用:微信小程序開發實戰(第2版)
- 寫給程序員的Python教程
- Vue.js 3應用開發與核心源碼解析
- Python青少年趣味編程
- Learning Jakarta Struts 1.2: a concise and practical tutorial
- C Primer Plus(第6版)中文版【最新修訂版】
- 前端程序員面試算法寶典
- Learning SaltStack(Second Edition)