官术网_书友最值得收藏!

Preprocessing Data for Machine Learning

Data preprocessing has a huge impact on machine learning. Like the saying "you are what you eat," the model's performance is a direct reflection of the data it's trained on. Many models depend on the data being transformed so that the continuous feature values have comparable limits. Similarly, categorical features should be encoded into numerical values. Although important, these steps are relatively simple and do not take very long.

The aspect of preprocessing that usually takes the longest is cleaning up messy data. Just take a look at this pie plot showing what data scientists from a particular survey spent most of their time doing.

Another thing to consider is the size of the datasets being used by many data scientists. As the dataset size increases, the prevalence of messy data increases as well, along with the difficulty in cleaning it.

Simply dropping the missing data is usually not the best option, because it's hard to justify throwing away samples where most of the fields have values. In doing so, we could lose valuable information that may hurt final model performance.

The steps involved in data preprocessing can be grouped as follows:

  • Merging data sets on common fields to bring all data into a single table
  • Feature engineering to improve the quality of data, for example, the use of dimensionality reduction techniques to build new features
  • Cleaning the data by dealing with duplicate rows, incorrect or missing values, and other issues that arise
  • Building the training data sets by standardizing or normalizing the required data and splitting it into training and testing sets

Let's explore some of the tools and methods for doing the preprocessing.

主站蜘蛛池模板: 彩票| 奉化市| 汤原县| 永嘉县| 永州市| 丹巴县| 西和县| 四子王旗| 兴安盟| 朝阳县| 晋江市| 大关县| 蓬溪县| 迁西县| 论坛| 阆中市| 武陟县| 德阳市| 横峰县| 华安县| 阿勒泰市| 图木舒克市| 射洪县| 浦江县| 沈阳市| 白河县| 贵州省| 甘孜县| 金寨县| 镇原县| 夹江县| 永川市| 宣化县| 合川市| 连云港市| 普兰县| 垣曲县| 德阳市| 黔南| 方城县| 肇庆市|