官术网_书友最值得收藏!

Preparing and Understanding Data

"We've got to use every piece of data and piece of information, and hopefully that will help us be accurate with our player evaluation. For us, that's our lifeblood."
– Billy Beane , General Manager Oakland Athletics, subject of the book Moneyball

Research consistently shows that machine learning and data science practitioners spend most of their time manipulating data and preparing it for analysis. Indeed, many find it the most tedious and least enjoyable part of their work. Numerous companies are offering solutions to the problem but, in my opinion, results at this point are varied. Therefore, in this first chapter, I shall endeavor to provide a way of tackling the problem that will ease the burden of getting your data ready for machine learning. The methodology introduced in this chapter will serve as the foundation for data preparation and for understanding many of the subsequent chapters. I propose that once you become comfortable with this tried and true process, it may very well become your favorite part of machine learning—as it is for me.

The following are the topics that we'll cover in this chapter:

  • Overview 
  • Reading the data
  • Handling duplicate observations
  • Descriptive statistics
  • Exploring categorical variables
  • Handling missing values
  • Zero and near-zero variance features
  • Treating the data
  • Correlation and linearity

主站蜘蛛池模板: 若尔盖县| 开江县| 来凤县| 兰坪| 云林县| 沙洋县| 邢台市| 广元市| 名山县| 嘉义市| 老河口市| 大方县| 寿光市| 六安市| 道真| 伊宁市| 涞源县| 霍林郭勒市| 常德市| 泰和县| 哈巴河县| 精河县| 同江市| 卢氏县| 太原市| 乌鲁木齐市| 井研县| 沾化县| 区。| 广平县| 亚东县| 舞钢市| 蒲江县| 福鼎市| 东丽区| 台江县| 石台县| 北票市| 台东县| 武宣县| 南通市|