官术网_书友最值得收藏!

Introduction

Before using data to answer critical business questions, the most important thing is to prepare it. Data is normally archived in files, and using Excel or text editors allows it to be easily obtained. However, data can be located in a range of different sources, such as databases, websites, and various file formats. Being able to import data from these sources is crucial.

There are four main types of data. Data recorded in text format is the simplest. As some users require storing data in a structured format, files with a .tab or .csv extension can be used to arrange data in a fixed number of columns. For many years, Excel has had a leading role in the field of data processing, and this software uses the .xls and .xlsx formats. Knowing how to read and manipulate data from databases is another crucial skill. Moreover, as most data is not stored in a database, one must know how to use the web scraping technique to obtain data from the Internet. As part of this chapter, we introduce how to scrape data from the Internet using the rvest package.

Many experienced developers have already created packages to allow beginners to obtain data more easily, and we focus on leveraging these packages to perform data extraction, transformation, and loading. In this chapter, we first learn how to utilize R packages to read data from a text format and scan files line by line. We then move to the topic of reading structured data from databases and Excel. Last, we learn how to scrape Internet and social network data by using the R web scraper.

主站蜘蛛池模板: 江安县| 赤水市| 定远县| 华容县| 旌德县| 察哈| 静乐县| 甘洛县| 新昌县| 沐川县| 灵寿县| 潜江市| 鄂温| 九台市| 凭祥市| 托克逊县| 上蔡县| 剑河县| 健康| 仙居县| 平阴县| 和顺县| 潍坊市| 永州市| 枝江市| 舞阳县| 柳江县| 合川市| 福泉市| 当阳市| 南昌市| 苏尼特右旗| 常德市| 巢湖市| 克拉玛依市| 晋宁县| 巫山县| 孝义市| 英德市| 青浦区| 邯郸市|