- Hands-On Data Science with Anaconda
- Dr. Yuxing Yan James Yan
- 355字
- 2021-06-25 21:08:49
Dealing with missing data
First, let's look at the missing codes for different languages:

Table 3.7: Missing codes for R, Python, Julia, and Octave
For R, the missing code is NA. Here are several functions we could use to remove those missing observations, shown in an example:
> head(na_example,20) [1] 2 1 3 2 1 3 1 4 3 2 2 NA 2 2 1 4 NA 1 1 2 > length(na_example) [1] 1000 > x<-na.exclude(na_example) > length(x) [1] 855 > head(x,20) [1] 2 1 3 2 1 3 1 4 3 2 2 2 2 1 4 1 1 2 1 2
In the previous example, we removed 145 missing values by using the R function called na.exclude(). We could also use the apropos() function to find more functions dealing with missing code in R, as shown here:
> apropos("^na.") [1] "na.action" "na.contiguous" [3] "na.exclude" "na.fail" [5] "na.omit" "na.pass" [7] "na_example" "names" [9] "names.POSIXlt" "names<-" [11] "names<-.POSIXlt" "namespaceExport" [13] "namespaceImport" "namespaceImportClasses" [15] "namespaceImportFrom" "namespaceImportMethods" [17] "napredict" "naprint" [19] "naresid" "nargs"
For Python, we have the following example, First, let’s generate a dataset called z.csv, see the R code given next. For the program, we generate 100 zeros as our missing values:
set.seed(123)
n=500
x<-rnorm(n)
x2<-x
m=100
y<-as.integer(runif(m)*n)
x[y]<-0
z<-matrix(x,n/5,5)
outFile<-"c:/temp/z.csv"
write.table(z,file=outFile,quote=F,row.names=F,col.names=F,sep=',')
The following Python program checks missing values for 5 columns, replace them with NaN or with the averages of each columns:
import scipy as sp
import pandas as pd
path="https://canisius.edu/~yany/data/"
dataSet="z.csv"
infile=path+dataset
#infile=”c:/temp/z.csv”
x=pd.read_csv(infile,header=None)
print(x.head())
print((x[[1,1,2,3,4,5]] ==0).sum())
The related output is shown here:

At this stage, we just know that for the first five columns, zero represents a missing value. The code of print((x[[1,2,3,4,5]] == 0).sum()) shows the number of zeros for five columns. For instance, there are five zeros for the first column. We could use scipy.NaN to replace those zeros, as shown here:
x2=x
x2[[1,2,3,4,5]] = x2[[1,2,3,4,5]].replace(0, sp.NaN)
print(x2.head())
The output with zeros is replaced with sp.NaN, as shown here:

If we plan to use the mean to replace those NaNs, we have the following code:
x3=x2
x3.fillna(x3.mean(), inplace=True)
print(x3.head())
The output is shown here:

- Splunk 7 Essentials(Third Edition)
- R Data Mining
- Learning Apache Cassandra(Second Edition)
- 群體智能與數據挖掘
- Matplotlib 3.0 Cookbook
- Maya 2012從入門到精通
- Apache Superset Quick Start Guide
- Enterprise PowerShell Scripting Bootcamp
- 走近大數據
- Photoshop CS5圖像處理入門、進階與提高
- Mastering Predictive Analytics with scikit:learn and TensorFlow
- 無人駕駛感知智能
- 計算機硬件技術基礎學習指導與練習
- 軟測之魂
- 巧學活用Linux