官术网_书友最值得收藏!

Merging different datasets

First, let's generate some hypothetical datasets. Then we will try to merge them according to certain rules. The easiest way is to use Monte Carlo simulation to generate those datasets:

> set.seed(123) 
> nStocks<-4 
> nPeriods<-24 
> x<-runif(nStocks*nPeriods,min=-0.1,max=0.20) 
> a<-matrix(x,nPeriods,nStocks) 
> d1<-as.Date("2000-01-01") 
> d2<-as.Date("2001-12-01") 
> dd<-seq(d1,d2,"months") 
> stocks<-data.frame(dd,a) 
> colnames(stocks)<-c("DATE",paste('stock',1:nStocks,sep=''))  

In the code, the first line sets up a random seed which will guarantee that any user will get the same random numbers if he/she uses the same random seed. The runif() function is used to get random numbers from a uniform distribution. In a sense, the preceding code would generate 2-year returns for five stocks. The dim() and head() function can be used to see the dimensions of the dataset and its first couple of lines, as shown here:

> dim(stocks) 
[1] 24  5 
> head(stocks) 
        DATE      stock1      stock2      stock3      stock4 
1 2000-01-01 -0.01372674  0.09671174 -0.02020821  0.11305472 
2 2000-02-01  0.13649154  0.11255914  0.15734831 -0.09981257 
3 2000-03-01  0.02269308  0.06321981 -0.08625065  0.04259497 
4 2000-04-01  0.16490522  0.07824261  0.03266002 -0.03396433 
5 2000-05-01  0.18214019 -0.01325208  0.13967745  0.01394496 
6 2000-06-01 -0.08633305 -0.05586591 -0.06343022  0.08383130  

Similarly, we could get the market returns, shown in the code here:

> d3<-as.Date("1999-01-01") 
> d4<-as.Date("2010-12-01") 
> dd2<-seq(d3,d4,"months") 
> y<-runif(length(dd2),min=-0.05,max=0.1) 
> market<-data.frame(dd2,y) 
> colnames(market)<-c("DATE","MKT") 

To make the merge more interesting, we deliberately make the market returns longer, shown here along with its dimensions and the first several lines:

> dim(market) 
[1] 144   2 
> head(market,2) 
        DATE          MKT 
1 1999-01-01  0.047184022 
2 1999-02-01 -0.002026907 

To merge them, we have the following code:

> final<-merge(stocks,market) 
> dim(final) 
[1] 24  6 
> head(final,2) 
        DATE      stock1     stock2      stock3      stock4        MKT 
1 2000-01-01 -0.01372674 0.09671174 -0.02020821  0.11305472 0.05094986 
2 2000-02-01  0.13649154 0.11255914  0.15734831 -0.09981257 0.06056166 

To find out more about the R merge() function, just type help(merge) and we can then specify inner merge, left-merge, right-merge, and out merge. The default setting in the previous case is called inner merge, as in picking up observations that only exist in both datasets.

The following Python program shows this concept clearly:

import pandas as pd 
import scipy as sp 
x= pd.DataFrame({'YEAR': [2010,2011, 2012, 2013], 
                 'FirmA': [0.2, -0.3, 0.13, -0.2], 
                 'FirmB': [0.1, 0, 0.05, 0.23]}) 
y = pd.DataFrame({'YEAR': [2011,2013,2014, 2015], 
                 'FirmC': [0.12, 0.23, 0.11, -0.1], 
                 'SP500': [0.1,0.17, -0.05, 0.13]}) 
 
print("n  inner  merge ") 
print(pd.merge(x,y, on='YEAR')) 
print(" n  outer merge  ") 
print(pd.merge(x,y, on='YEAR',how='outer')) 
print("n  left  merge  ") 
print(pd.merge(x,y, on='YEAR',how='left')) 
print("n  right  merge ") 
print(pd.merge(x,y, on='YEAR',how='right')) 

The related output is shown here:

主站蜘蛛池模板: 盐源县| 仲巴县| 龙海市| 布拖县| 齐河县| 襄垣县| 绥江县| 砀山县| 静海县| 高州市| 浮梁县| 昆山市| 舞阳县| 苗栗县| 儋州市| 大田县| 北海市| 兰西县| 子洲县| 涟水县| 平江县| 乐至县| 门头沟区| 霍邱县| 荣成市| 思茅市| 林周县| 安国市| 宜丰县| 揭阳市| 万州区| 兖州市| 锡林浩特市| 驻马店市| 简阳市| 来凤县| 太湖县| 乌审旗| 珲春市| 阿拉善盟| 德兴市|