官术网_书友最值得收藏!

Preparing and Understanding Data

"We've got to use every piece of data and piece of information, and hopefully that will help us be accurate with our player evaluation. For us, that's our lifeblood."
– Billy Beane , General Manager Oakland Athletics, subject of the book Moneyball

Research consistently shows that machine learning and data science practitioners spend most of their time manipulating data and preparing it for analysis. Indeed, many find it the most tedious and least enjoyable part of their work. Numerous companies are offering solutions to the problem but, in my opinion, results at this point are varied. Therefore, in this first chapter, I shall endeavor to provide a way of tackling the problem that will ease the burden of getting your data ready for machine learning. The methodology introduced in this chapter will serve as the foundation for data preparation and for understanding many of the subsequent chapters. I propose that once you become comfortable with this tried and true process, it may very well become your favorite part of machine learning—as it is for me.

The following are the topics that we'll cover in this chapter:

  • Overview 
  • Reading the data
  • Handling duplicate observations
  • Descriptive statistics
  • Exploring categorical variables
  • Handling missing values
  • Zero and near-zero variance features
  • Treating the data
  • Correlation and linearity

主站蜘蛛池模板: 平南县| 阜宁县| 蛟河市| 故城县| 萍乡市| 阿坝县| 保定市| 保靖县| 紫金县| 克拉玛依市| 外汇| 扎鲁特旗| 都昌县| 永城市| 钟祥市| 隆子县| 秦安县| 黄骅市| 宾川县| 泸水县| 唐河县| 南昌县| 措勤县| 宁德市| 鸡东县| 邯郸县| 开江县| 徐水县| 长武县| 都兰县| 宜君县| 天祝| 许昌市| 云阳县| 华亭县| 安福县| 樟树市| 成都市| 渝北区| 高青县| 精河县|