官术网_书友最值得收藏!

Preparing and Understanding Data

"We've got to use every piece of data and piece of information, and hopefully that will help us be accurate with our player evaluation. For us, that's our lifeblood."
– Billy Beane , General Manager Oakland Athletics, subject of the book Moneyball

Research consistently shows that machine learning and data science practitioners spend most of their time manipulating data and preparing it for analysis. Indeed, many find it the most tedious and least enjoyable part of their work. Numerous companies are offering solutions to the problem but, in my opinion, results at this point are varied. Therefore, in this first chapter, I shall endeavor to provide a way of tackling the problem that will ease the burden of getting your data ready for machine learning. The methodology introduced in this chapter will serve as the foundation for data preparation and for understanding many of the subsequent chapters. I propose that once you become comfortable with this tried and true process, it may very well become your favorite part of machine learning—as it is for me.

The following are the topics that we'll cover in this chapter:

  • Overview 
  • Reading the data
  • Handling duplicate observations
  • Descriptive statistics
  • Exploring categorical variables
  • Handling missing values
  • Zero and near-zero variance features
  • Treating the data
  • Correlation and linearity

主站蜘蛛池模板: 丹江口市| 井冈山市| 施甸县| 南充市| 鹤峰县| 和平区| 贺兰县| 通江县| 仁化县| 郸城县| 宜都市| 修水县| 布尔津县| 淮滨县| 施甸县| 顺平县| 谷城县| 定结县| 四子王旗| 双柏县| 侯马市| 牡丹江市| 潜山县| 洞口县| 蓝山县| 青铜峡市| 宁波市| 湖南省| 平江县| 谢通门县| 华安县| 南召县| 文昌市| 抚州市| 金阳县| 吴江市| 通江县| 河池市| 时尚| 同德县| 瑞丽市|