- Hands-On Data Science with Anaconda
- Dr. Yuxing Yan James Yan
- 283字
- 2021-06-25 21:08:50
Merging different datasets
First, let's generate some hypothetical datasets. Then we will try to merge them according to certain rules. The easiest way is to use Monte Carlo simulation to generate those datasets:
> set.seed(123) > nStocks<-4 > nPeriods<-24 > x<-runif(nStocks*nPeriods,min=-0.1,max=0.20) > a<-matrix(x,nPeriods,nStocks) > d1<-as.Date("2000-01-01") > d2<-as.Date("2001-12-01") > dd<-seq(d1,d2,"months") > stocks<-data.frame(dd,a) > colnames(stocks)<-c("DATE",paste('stock',1:nStocks,sep=''))
In the code, the first line sets up a random seed which will guarantee that any user will get the same random numbers if he/she uses the same random seed. The runif() function is used to get random numbers from a uniform distribution. In a sense, the preceding code would generate 2-year returns for five stocks. The dim() and head() function can be used to see the dimensions of the dataset and its first couple of lines, as shown here:
> dim(stocks) [1] 24 5 > head(stocks) DATE stock1 stock2 stock3 stock4 1 2000-01-01 -0.01372674 0.09671174 -0.02020821 0.11305472 2 2000-02-01 0.13649154 0.11255914 0.15734831 -0.09981257 3 2000-03-01 0.02269308 0.06321981 -0.08625065 0.04259497 4 2000-04-01 0.16490522 0.07824261 0.03266002 -0.03396433 5 2000-05-01 0.18214019 -0.01325208 0.13967745 0.01394496 6 2000-06-01 -0.08633305 -0.05586591 -0.06343022 0.08383130
Similarly, we could get the market returns, shown in the code here:
> d3<-as.Date("1999-01-01") > d4<-as.Date("2010-12-01") > dd2<-seq(d3,d4,"months") > y<-runif(length(dd2),min=-0.05,max=0.1) > market<-data.frame(dd2,y) > colnames(market)<-c("DATE","MKT")
To make the merge more interesting, we deliberately make the market returns longer, shown here along with its dimensions and the first several lines:
> dim(market) [1] 144 2 > head(market,2) DATE MKT 1 1999-01-01 0.047184022 2 1999-02-01 -0.002026907
To merge them, we have the following code:
> final<-merge(stocks,market) > dim(final) [1] 24 6 > head(final,2) DATE stock1 stock2 stock3 stock4 MKT 1 2000-01-01 -0.01372674 0.09671174 -0.02020821 0.11305472 0.05094986 2 2000-02-01 0.13649154 0.11255914 0.15734831 -0.09981257 0.06056166
To find out more about the R merge() function, just type help(merge) and we can then specify inner merge, left-merge, right-merge, and out merge. The default setting in the previous case is called inner merge, as in picking up observations that only exist in both datasets.
The following Python program shows this concept clearly:
import pandas as pd import scipy as sp x= pd.DataFrame({'YEAR': [2010,2011, 2012, 2013], 'FirmA': [0.2, -0.3, 0.13, -0.2], 'FirmB': [0.1, 0, 0.05, 0.23]}) y = pd.DataFrame({'YEAR': [2011,2013,2014, 2015], 'FirmC': [0.12, 0.23, 0.11, -0.1], 'SP500': [0.1,0.17, -0.05, 0.13]}) print("n inner merge ") print(pd.merge(x,y, on='YEAR')) print(" n outer merge ") print(pd.merge(x,y, on='YEAR',how='outer')) print("n left merge ") print(pd.merge(x,y, on='YEAR',how='left')) print("n right merge ") print(pd.merge(x,y, on='YEAR',how='right'))
The related output is shown here:

- 自動控制原理
- ServiceNow Cookbook
- 深度學(xué)習(xí)中的圖像分類與對抗技術(shù)
- Hands-On Cybersecurity with Blockchain
- 3D Printing for Architects with MakerBot
- Windows環(huán)境下32位匯編語言程序設(shè)計
- Spark大數(shù)據(jù)技術(shù)與應(yīng)用
- Linux服務(wù)與安全管理
- 空間機(jī)械臂建模、規(guī)劃與控制
- 精通數(shù)據(jù)科學(xué):從線性回歸到深度學(xué)習(xí)
- Red Hat Enterprise Linux 5.0服務(wù)器構(gòu)建與故障排除
- Learning iOS 8 for Enterprise
- 大數(shù)據(jù):從基礎(chǔ)理論到最佳實(shí)踐
- Keras Reinforcement Learning Projects
- Internet of Things with Raspberry Pi 3