官术网_书友最值得收藏!

Preparing and Understanding Data

"We've got to use every piece of data and piece of information, and hopefully that will help us be accurate with our player evaluation. For us, that's our lifeblood."
– Billy Beane , General Manager Oakland Athletics, subject of the book Moneyball

Research consistently shows that machine learning and data science practitioners spend most of their time manipulating data and preparing it for analysis. Indeed, many find it the most tedious and least enjoyable part of their work. Numerous companies are offering solutions to the problem but, in my opinion, results at this point are varied. Therefore, in this first chapter, I shall endeavor to provide a way of tackling the problem that will ease the burden of getting your data ready for machine learning. The methodology introduced in this chapter will serve as the foundation for data preparation and for understanding many of the subsequent chapters. I propose that once you become comfortable with this tried and true process, it may very well become your favorite part of machine learning—as it is for me.

The following are the topics that we'll cover in this chapter:

  • Overview 
  • Reading the data
  • Handling duplicate observations
  • Descriptive statistics
  • Exploring categorical variables
  • Handling missing values
  • Zero and near-zero variance features
  • Treating the data
  • Correlation and linearity

主站蜘蛛池模板: 犍为县| 和田市| 卫辉市| 新化县| 汉沽区| 枝江市| 白沙| 达拉特旗| 雅安市| 兰州市| 尼玛县| 涿州市| 丰城市| 新郑市| 香港 | 墨竹工卡县| 九台市| 湄潭县| 出国| 漾濞| 西乌| 剑川县| 清新县| 巨野县| 峡江县| 霍林郭勒市| 卢龙县| 聂拉木县| 林口县| 合作市| 清水县| 湘乡市| 安达市| 贡嘎县| 嘉兴市| 西宁市| 龙井市| 正宁县| 穆棱市| 稻城县| 明星|