官术网_书友最值得收藏!

Summary

The main learning outcomes of this chapter are summarized as follows:

  • Various methods and variations in importing a dataset using pandas: read_csv and its variations, reading a dataset using open method in Python, reading a file in chunks using the open method, reading directly from a URL, specifying the column names from a list, changing the delimiter of a dataset, and so on.
  • Basic exploratory analysis of data: observing a thumbnail of data, shape, column names, column types, and summary statistics for numerical variables
  • Handling missing values: The reason for incorporation of missing values, why it is important to treat them properly, how to treat them properly by deletion and imputation, and various methods of imputing data.
  • Creating dummy variables: creating dummy variables for categorical variables to be used in the predictive models.
  • Basic plotting: scatter plotting, histograms and boxplots; their meaning and relevance; and how they are plotted.

This chapter is a head start into our journey to explore our data and wrangle it to make it modelling-worthy. The next chapter will go deeper in this pursuit whereby we will learn to aggregate values for categorical variables, sub-set the dataset, merge two datasets, generate random numbers, and sample a dataset.

Cleaning, as we have seen in the last chapter takes about 80% of the modelling time, so it's of critical importance and the methods we are learning will come in handy in the pursuit of that goal.

主站蜘蛛池模板: 凯里市| 慈溪市| 光泽县| 长白| 土默特左旗| 英超| 阿巴嘎旗| 淮北市| 瑞金市| 木兰县| 馆陶县| 新化县| 全州县| 辉县市| 全南县| 绥滨县| 密山市| 通河县| 喀什市| 中牟县| 林甸县| 八宿县| 安陆市| 安图县| 西畴县| 瑞安市| 延寿县| 滨州市| 阿鲁科尔沁旗| 曲沃县| 金阳县| 台东市| 宜城市| 榆林市| 庐江县| 吴堡县| 大城县| 彭泽县| 三都| 普兰店市| 松溪县|