官术网_书友最值得收藏!

Extracting features to predict outcomes

That available data needs to be accessible and meaningful in order for the algorithm to extract information.

Let's consider a simple example. Imagine that we want to predict the market price of a house in a given city. We can think of many variables that would be predictors of the price of a house: the number of rooms or bathrooms, the neighborhood, the surface, the heating system, and so on. These variables are called features, attributes, or predictors. The value that we want to predict is called the outcome or the target.

If we want our predictions to be reliable, we need several features. Predicting the price of a house based on its surface alone would not be very efficient. Many other factors influence the price of a house and our dataset should include as many as possible (with conditions).

It's often possible to add large numbers of attributes to a model to try to improve the predictions. For instance, in our housing pricing prediction, we could add all the characteristics of the house (bathroom, superficies, heating system, the number of windows). Some of these variables would bring more information to our pricing model and increase the accuracy of our predictions, while others would just add noise and confuse the algorithm. Adding new variables to a predicting model does not always improve the predictions.

In order to make reliable predictions, each of the new features you bring to your model must bring some valuable piece of information. However, this is also not always the case. As we will see in Chapter 2, Machine Learning Definitions and Concepts, correlated predictors can hurt the performances of the model.

Predictive analytics is built on several assumptions and conditions:

  • The value you are trying to predict is predictable and not just some random noise.
  • You have access to data that has some degree of association to the target.
  • The available dataset is large enough. Reliable predictions cannot be inferred from a dataset that is too small. (For instance, you can define and therefore predict a line with two points but you cannot infer data that follows a sine curve from only two points.)
  • The new data you will base future predictions on is similar to the one you parameterized and trained your model on.

You may have a great dataset, but that does not mean it will be efficient for predictions.

These conditions on the data are very general. In the case of SGD, the conditions are more constrained.

主站蜘蛛池模板: 乐安县| 芦溪县| 上蔡县| 柘荣县| 定州市| 溧水县| 贵州省| 洛宁县| 遂溪县| 桦川县| 阿拉善盟| 东明县| 赤城县| 鄂温| 宜良县| 司法| 应城市| 融水| 元氏县| 鸡西市| 汾阳市| 格尔木市| 颍上县| 云南省| 神木县| 大英县| 泸定县| 镶黄旗| 镇原县| 罗田县| 阿勒泰市| 台中市| 漠河县| 襄城县| 浦县| 吉水县| 马鞍山市| 西盟| 莫力| 临武县| 维西|