官术网_书友最值得收藏!

Summary

The main learning outcomes of this chapter are summarized as follows:

  • Various methods and variations in importing a dataset using pandas: read_csv and its variations, reading a dataset using open method in Python, reading a file in chunks using the open method, reading directly from a URL, specifying the column names from a list, changing the delimiter of a dataset, and so on.
  • Basic exploratory analysis of data: observing a thumbnail of data, shape, column names, column types, and summary statistics for numerical variables
  • Handling missing values: The reason for incorporation of missing values, why it is important to treat them properly, how to treat them properly by deletion and imputation, and various methods of imputing data.
  • Creating dummy variables: creating dummy variables for categorical variables to be used in the predictive models.
  • Basic plotting: scatter plotting, histograms and boxplots; their meaning and relevance; and how they are plotted.

This chapter is a head start into our journey to explore our data and wrangle it to make it modelling-worthy. The next chapter will go deeper in this pursuit whereby we will learn to aggregate values for categorical variables, sub-set the dataset, merge two datasets, generate random numbers, and sample a dataset.

Cleaning, as we have seen in the last chapter takes about 80% of the modelling time, so it's of critical importance and the methods we are learning will come in handy in the pursuit of that goal.

主站蜘蛛池模板: 泸水县| 若尔盖县| 诸城市| 罗山县| 鄂尔多斯市| 青田县| 昭苏县| 蒙阴县| 白朗县| 宝应县| 鄂温| 长宁县| 泰来县| 高雄市| 洞口县| 花莲县| 太康县| 岳池县| 通州区| 昌图县| 桂林市| 丁青县| 卢龙县| 奉贤区| 黄浦区| 平潭县| 平塘县| 囊谦县| 新闻| 基隆市| 客服| 长海县| 保德县| 特克斯县| 竹北市| 孝昌县| 随州市| 延寿县| 河西区| 自治县| 兴业县|