官术网_书友最值得收藏!

Representing Data and Engineering Features

In the last chapter, we built our very first supervised learning models and applied them to some classic datasets, such as the Iris and the Boston datasets. However, in the real world, data rarely comes in a neat <n_samples x n_features> feature matrix that is part of a pre-packaged database. Instead, it is our own responsibility to find a way to represent the data in a meaningful way. The process of finding the best way to represent our data is known as feature engineering, and it is one of the main tasks of data scientists and machine learning practitioners trying to solve real-world problems.

I know you would rather jump right to the end and build the deepest neural network mankind has ever seen. But, trust me, this stuff is important! Representing our data in the right way can have a much greater influence on the performance of our supervised model than the exact parameters we choose. And we get to invent our own features, too.

In this chapter, we will therefore go over some common feature engineering tasks. Specifically, we want to answer the following questions:

  • What are some common preprocessing techniques that everyone uses but nobody talks about?
  • How do we represent categorical variables, such as the names of products, of colors, or of fruits?
  • How would we even go about representing text?
  • What is the best way to encode images, and what do SIFT and SURF stand for?

Let's start from the top.

主站蜘蛛池模板: 城口县| 温泉县| 井陉县| 中阳县| 嘉定区| 新建县| 桐乡市| 临城县| 拉萨市| 瓦房店市| 额尔古纳市| 巴青县| 内乡县| 和龙市| 镇江市| 重庆市| 阳城县| 曲松县| 当涂县| 紫云| 平度市| 若羌县| 灯塔市| 浦北县| 泰和县| 陇西县| 汝州市| 桐梓县| 大港区| 花莲市| 扎鲁特旗| 远安县| 珠海市| 吉隆县| 双柏县| 洛南县| 麦盖提县| 襄汾县| 拜城县| 茶陵县| 桑植县|