- Applied Deep Learning with Python
- Alex Galea Luis Capelo
- 283字
- 2021-08-13 15:53:10
Preprocessing Data for Machine Learning
Data preprocessing has a huge impact on machine learning. Like the saying "you are what you eat," the model's performance is a direct reflection of the data it's trained on. Many models depend on the data being transformed so that the continuous feature values have comparable limits. Similarly, categorical features should be encoded into numerical values. Although important, these steps are relatively simple and do not take very long.
Another thing to consider is the size of the datasets being used by many data scientists. As the dataset size increases, the prevalence of messy data increases as well, along with the difficulty in cleaning it.
Simply dropping the missing data is usually not the best option, because it's hard to justify throwing away samples where most of the fields have values. In doing so, we could lose valuable information that may hurt final model performance.
The steps involved in data preprocessing can be grouped as follows:
- Merging data sets on common fields to bring all data into a single table
- Feature engineering to improve the quality of data, for example, the use of dimensionality reduction techniques to build new features
- Cleaning the data by dealing with duplicate rows, incorrect or missing values, and other issues that arise
- Building the training data sets by standardizing or normalizing the required data and splitting it into training and testing sets
Let's explore some of the tools and methods for doing the preprocessing.
- Instant Testing with CasperJS
- Python機器學習:數據分析與評分卡建模(微課版)
- Spring技術內幕:深入解析Spring架構與設計
- GitLab Repository Management
- Java Web基礎與實例教程
- Python Data Analysis Cookbook
- Instant Ext.NET Application Development
- Android程序設計基礎
- Python Data Structures and Algorithms
- Node.js:來一打 C++ 擴展
- TMS320LF240x芯片原理、設計及應用
- HTML5權威指南
- Java Web從入門到精通(第3版)
- 硬件產品設計與開發:從原型到交付
- Python自然語言理解:自然語言理解系統開發與應用實戰