官术网_书友最值得收藏!

Correlation and linearity

For this task, we return to our old friend the caret package. We'll start by creating a correlation matrix, using the Spearman Rank method, then apply the findCorrelation() function for all correlations above 0.9:

df_corr <- cor(gettysburg_treated, method = "spearman")

high_corr <- caret::findCorrelation(df_corr, cutoff = 0.9)
Why Spearman versus Pearson correlation? Spearman is free from any distribution assumptions and is robust enough for any task at hand:  http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/ .

The high_corr object is a list of integers that correspond to feature column numbers. Let's dig deeper into this:

high_corr

The output of the preceding code is as follows:

[1]   9   4    22    43   3   5

The column indices refer to the following feature names:

colnames(gettysburg_treated)[c(9, 4, 22, 43, 3, 5)]

The output of the preceding code is as follows:

[1]                       "total_casualties"       "wounded" "type_lev_x_Artillery"
[4] "army_lev_x_Confederate" "killed_isNA" "wounded_isNA"

We saw the features that're highly correlated to some other feature. For instance, army_lev_x_Confederate is perfectly and negatively correlation with army_lev_x_Union. After all, you can only two armies here, and Colonel Fremantle of the British Coldstream Guards was merely an observer. To delete these features, just filter your dataframe by the list we created:

gettysburg_noHighCorr <- gettysburg_treated[, -high_corr]

There you go, they're now gone. But wait! That seems a little too clinical, and maybe we should apply our judgment or the judgment of an SME to the problem? As before, let's create a tibble for further exploration:

df_corr <- data.frame(df_corr)

df_corr$feature1 <- row.names(df_corr)

gettysburg_corr <-
tidyr::gather(data = df_corr,
key = "feature2",
value = "correlation",
-feature1)

gettysburg_corr <-
gettysburg_corr %>%
dplyr::filter(feature1 != feature2)

What just happened? First of all, the correlation matrix was turned into a dataframe. Then, the row names became the values for the first feature. Using tidyr, the code created the second feature and placed the appropriate value with an observation, and we cleaned it up to get unique pairs. This screenshot shows the results. You can see that the Confederate and Union armies have a perfect negative correlation:

You can see that it would be safe to dedupe on correlation as we did earlier. I like to save this to a spreadsheet and work with SMEs to understand what features we can drop or combine and so on. 

After handling the correlations, I recommend exploring and removing as needed linear combinations. Dealing with these combinations is a similar methodology to high correlations:

linear_combos <- caret::findLinearCombos(gettysburg_noHighCorr)

linear_combos

The output of the preceding code is as follows:

$`linearCombos`
$`linearCombos`[[1]]
[1] 16 7 8 9 10 11 12 13 14 15

$remove
[1] 16

The output tells us that feature column 16 is linearly related to those others, and we can solve the problem by removing it. What are these feature names? Let's have a look:

colnames(gettysburg_noHighCorr)[c(16, 7, 8, 9, 10, 11, 12, 13, 14, 15)]

The output of the preceding code is as follows:

[1]            "total_guns"         "X3inch_rifles" "X10lb_parrots"    "X12lb_howitzers" "X12lb_napoleons"
[6] "X6lb_howitzers" "X24lb_howitzers" "X20lb_parrots" "X12lb_whitworths" "X14lb_rifles"

Removing the feature on the number of "total_guns" will solve the problem. This makes total sense since it's the number of guns in an artillery battery. Most batteries, especially in the Union, had only one type of gun. Even with multiple linear combinations, it's an easy task with this bit of code to get rid of the necessary features:

linear_remove <- colnames(gettysburg_noHighCorr[16])

df <- gettysburg_noHighCorr[, !(colnames(gettysburg_noHighCorr) %in% linear_remove)]

dim(df)

The output of the preceding code is as follows:

[1] 587   39

There you have it, a nice clean dataframe of 587 observations and 39 features. Now depending on the modeling, you may have to scale this data or perform other transformations, but this data, in this format, makes all of that easier. Regardless of your prior knowledge or interest of one of the most important battles in history, and the bloodiest on American soil, you've developed a workable understanding of the Order of Battle, and the casualties at the regimental or battery level. Start treating your data, not next week or next month, but right now!

If you desire, you can learn more about the battle here:  https://www.battlefields.org/learn/civil-war/battles/gettysburg.

主站蜘蛛池模板: 庆城县| 云霄县| 河间市| 玉环县| 保靖县| 犍为县| 剑河县| 宁陵县| 景德镇市| 兴城市| 定兴县| 富平县| 增城市| 阳泉市| 得荣县| 金溪县| 五大连池市| 瓦房店市| 绥棱县| 吴忠市| 东阳市| 西充县| 疏附县| 浑源县| 绩溪县| 巴楚县| 临漳县| 成武县| 留坝县| 巴彦县| 汝城县| 星座| 长兴县| 隆化县| 宝丰县| 莱西市| 商城县| 南靖县| 衡水市| 吉林省| 彩票|