官术网_书友最值得收藏!

Summary

Feature selection is the first (and sometimes the most important) step in a machine learning pipeline. Not all the features are useful for our purposes and some of them are expressed using different notations, so it's often necessary to preprocess our dataset before any further operations. 

We saw how to split the data into training and test sets using a random shuffle and how to manage missing elements. Another very important section covered the techniques used to manage categorical data or labels, which are very common when a certain feature assumes only a discrete set of values.

Then we analyzed the problem of dimensionality. Some datasets contain many features which are correlated with each other, so they don't provide any new information but increase the computational complexity and reduce the overall performances. Principal component analysis is a method to select only a subset of features which contain the largest amount of total variance. This approach, together with its variants, allows to decorrelate the features and reduce the dimensionality without a drastic loss in terms of accuracy. Dictionary learning is another technique used to extract a limited number of building blocks from a dataset, together with the information needed to rebuild each sample. This approach is particularly useful when the dataset is made up of different versions of similar elements (such as images, letters, or digits).

In the next chapter, we're going to discuss linear regression, which is the most diffused and simplest supervised approach to predict continuous values. We'll also analyze how to overcome some limitations and how to solve non-linear problems using the same algorithms.

主站蜘蛛池模板: 沾化县| 湖口县| 山西省| 罗定市| 蓬安县| 教育| 沈阳市| 白河县| 简阳市| 石门县| 平潭县| 锡林浩特市| 当涂县| 阿拉尔市| 临泽县| 安乡县| 长丰县| 铜山县| 谢通门县| 眉山市| 县级市| 新巴尔虎左旗| 广东省| 达日县| 岱山县| 信宜市| 那曲县| 隆尧县| 齐齐哈尔市| 沁阳市| 安阳县| 隆化县| 泰州市| 柳河县| 故城县| 黔南| 安岳县| 阜阳市| 闽侯县| 克山县| 柞水县|