官术网_书友最值得收藏!

Dealing with missing data

First, let's look at the missing codes for different languages:

Table 3.7: Missing codes for R, Python, Julia, and Octave

For R, the missing code is NA. Here are several functions we could use to remove those missing observations, shown in an example:

> head(na_example,20) 
[1]  2  1  3  2  1  3  1  4  3  2  2 NA  2  2  1  4 NA  1  1  2 
> length(na_example) 
[1] 1000 
> x<-na.exclude(na_example) 
> length(x) 
[1] 855 
> head(x,20) 
[1] 2 1 3 2 1 3 1 4 3 2 2 2 2 1 4 1 1 2 1 2 

In the previous example, we removed 145 missing values by using the R function called na.exclude(). We could also use the apropos() function to find more functions dealing with missing code in R, as shown here:

 > apropos("^na.") 
 [1] "na.action"              "na.contiguous"          
 [3] "na.exclude"             "na.fail"                
 [5] "na.omit"                "na.pass"                
 [7] "na_example"             "names"                  
 [9] "names.POSIXlt"          "names<-"                
[11] "names<-.POSIXlt"        "namespaceExport"        
[13] "namespaceImport"        "namespaceImportClasses" 
[15] "namespaceImportFrom"    "namespaceImportMethods" 
[17] "napredict"              "naprint"                
[19] "naresid"                "nargs" 
 

For Python, we have the following example, First, let’s generate a dataset called z.csv, see the R code given next. For the program, we generate 100 zeros as our missing values:

set.seed(123)
n=500
x<-rnorm(n)
x2<-x
m=100
y<-as.integer(runif(m)*n)
x[y]<-0
z<-matrix(x,n/5,5)
outFile<-"c:/temp/z.csv"
write.table(z,file=outFile,quote=F,row.names=F,col.names=F,sep=',')

The following Python program checks missing values for 5 columns, replace them with NaN or with the averages of each columns:

import scipy as sp
import pandas as pd
path="https://canisius.edu/~yany/data/"
dataSet="z.csv"
infile=path+dataset
#infile=”c:/temp/z.csv”
x=pd.read_csv(infile,header=None)
print(x.head())
print((x[[1,1,2,3,4,5]] ==0).sum())

The related output is shown here:

At this stage, we just know that for the first five columns, zero represents a missing value. The code of print((x[[1,2,3,4,5]] == 0).sum()) shows the number of zeros for five columns. For instance, there are five zeros for the first column. We could use scipy.NaN to replace those zeros, as shown here:

x2=x
x2[[1,2,3,4,5]] = x2[[1,2,3,4,5]].replace(0, sp.NaN)
print(x2.head())

The output with zeros is replaced with sp.NaN, as shown here:

If we plan to use the mean to replace those NaNs, we have the following code:

x3=x2
x3.fillna(x3.mean(), inplace=True)
print(x3.head())

The output is shown here:

主站蜘蛛池模板: 胶州市| 柘荣县| 报价| 榕江县| 富宁县| 江安县| 渭南市| 红原县| 昌图县| 阿拉善右旗| 黎川县| 娄底市| 右玉县| 剑川县| 元谋县| 灵寿县| 龙胜| 于田县| 稻城县| 诸暨市| 曲阳县| 婺源县| 卢氏县| 如皋市| 西乌珠穆沁旗| 离岛区| 小金县| 盘锦市| 乌拉特中旗| 阆中市| 苏尼特左旗| 松桃| 长岛县| 钟祥市| 抚州市| 汉川市| 克东县| 深圳市| 西贡区| 巴塘县| 崇阳县|