- Hands-On Ensemble Learning with R
- Prabhanjan Narayanachar Tattar
- 420字
- 2021-07-23 19:10:51
Overseas visitors
The New Zealand overseas dataset is dealt with in detail in Chapter 10 of Tattar, et al. (2017). Here, the number of overseas visitors is captured on a monthly basis from January 1977 to December 1995. We have visitors' data available for over 228 months. The osvisit.dat
file is available at multiple web links, including https://www.stat.auckland.ac.nz/~ihaka/courses/726-/osvisit.dat and https://github.com/AtefOuni/ts/blob/master/Data/osvisit.dat. It is also available in the book's code bundle. We will import the data in R, convert it into a time series object, and visualize it:
> osvisit <- read.csv("../Data/osvisit.dat", header= FALSE) > osv <- ts(osvisit$V1, start = 1977, frequency = 12) > class(osv) [1] "ts" > plot.ts(osv)

Figure 1: New Zealand overseas visitors
Here, the dataset is not partitioned! Time series data can't be arbitrarily partitioned into training and testing parts. The reason is quite simple: if we have five observations in a time sequential order y1, y2, y3, y4, y5, and we believe that the order of impact is y1→y2→y3→y4→y5, an arbitrary partition of y1, y2, y5, will have different behavior. It won't have the same information as three consecutive observations. Consequently, the time series partitioning has to preserve the dependency structure; we keep the most recent part of the time as the test data. For the five observations example, we choose a sample of y1, y2, y3, as the test data. The partitioning is simple, and we will cover this in Chapter 11, Ensembling Time Series Models.
Live testing experiments rarely yield complete observations. In reliability analysis, as well as survival analysis/clinical trials, the units/patients are observed up to a predefined time and a note is made regarding whether a specific event occurs, which is usually failure or death. A considerable fraction of observations would not have failed by the pre-decided time, and the analysis cannot wait for all units to fail. A reason to curtail the study might be that the time by which all units would have failed would be very large, and it would be expensive to continue the study until such a time. Consequently, we are left with incomplete observations; we only know that the lifetime of the units lasts for at least the predefined time before the study was called off, and the event of interest may occur sometime in the future. Consequently, some observations are censored and the data is referred to as censored data. Special statistical methods are required for the analysis of such datasets. We will give an example of these types of datasets next, and analyze them later, in Chapter 10, Ensembling Survival Models.