- Advanced Machine Learning with R
- Cory Lesmeister Dr. Sunil Kumar Chinnamgari
- 538字
- 2021-06-24 14:24:32
Reading the data
This first task will load the data and show how to get a how level understanding of its structure and dimensions as well as install the necessary packages.
You have two ways to access the data, which resides on GitHub. You can download gettysburg.csv directly from the site at this link: https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/blob/master/Data/gettysburg.csv, or you can use the RCurl package. An example of how to use the package is available here: https://github.com/opetchey/RREEBES/wiki/Reading-data-and-code-from-an-online-github-repository.
Let's assume you have the file in your working directory, so let's begin by installing the necessary packages:
install.packages("caret")
install.packages("janitor")
install.packages("readr")
install.packages("sjmisc")
install.packages("skimr")
install.packages("tidyverse")
install.packages("vtreat")
Let me make a quick note about how I've learned (the hard way) about how to correctly write code. With the packages installed, we could now specifically call the libraries into the R environment. However, it's a best practice and necessary when putting code into production that a function that isn't in base R be specified. First, this helps you and unfortunate others to read your code with an understanding of which library is mapped to a specific function. It also eliminates potential errors because different packages call different functions the same thing. The example that comes to my mind is the tsoutliers() function. The function is available in the forecast package and was in the tsoutliers package during earlier versions. Now I know this extra typing might seem unwieldy and unnecessary, but once you discipline yourself to do it, you'll find that it's well worth the effort.
There's one library we'll call and that's magrittr, which allows the use of a pipe-operator, %>%, to chain code together:
library(magrittr)
We're now ready to load the .csv file. In doing so, let's utilize the read_csv() function from readr as it's faster than base R and creates a tibble dataframe. In most cases, using tibbles in a tidyverse style is easier to write and understand. If you want to learn all the benefits of tidyverse, check out their website: tidyverse.org.
The only thing we need to specify in the function is our filename:
gettysburg <- readr::read_csv("~/gettysburg.csv")
Here's a look at the column (feature) names:
colnames(gettysburg)
[1] "type" "state" "regiment_or_battery" "brigade"
[5] "division" "corps" "army" "july1_Commander"
[9] "Cdr_casualty" "men" "killed" "wounded"
[13] "captured" "missing" "total_casualties" "3inch_rifles"
[17] "4.5inch_rifles" "10lb_parrots" "12lb_howitzers" "12lb_napoleons"
[21] "6lb_howitzers" "24lb_howitzers" "20lb_parrots" "12lb_whitworths"
[25] "14lb_rifles" "total_guns"
We have 26 features in this data, and some of you're asking yourself things like, what the heck is a 20 pound parrot? If you put it in a search engine, you'll probably end up with the bird and not the 20 pound Parrot rifled artillery gun. You can see the dimensions of the data in RStudio in your Global Environment view, or you can dig on your own to see there're 590 observations:
dim(gettysburg)
[1] 590 26
So we have 590 observations of 26 features, but this data suffers from the issues that permeate large and complex data. Next, we'll explore if there're any duplicate observations and how to deal with them efficiently.
- 新媒體跨界交互設(shè)計(jì)
- Windows phone 7.5 application development with F#
- Cortex-M3 + μC/OS-II嵌入式系統(tǒng)開發(fā)入門與應(yīng)用
- 數(shù)字邏輯(第3版)
- Camtasia Studio 8:Advanced Editing and Publishing Techniques
- Mastering Adobe Photoshop Elements
- The Deep Learning with Keras Workshop
- 基于Apache Kylin構(gòu)建大數(shù)據(jù)分析平臺
- 筆記本電腦使用、維護(hù)與故障排除從入門到精通(第5版)
- Arduino BLINK Blueprints
- BeagleBone Robotic Projects
- 微型計(jì)算機(jī)系統(tǒng)原理及應(yīng)用:國產(chǎn)龍芯處理器的軟件和硬件集成(基礎(chǔ)篇)
- 單片機(jī)項(xiàng)目設(shè)計(jì)教程
- 可編程邏輯器件項(xiàng)目開發(fā)設(shè)計(jì)
- Drupal Rules How-to