官术网_书友最值得收藏!

Extracting features to predict outcomes

That available data needs to be accessible and meaningful in order for the algorithm to extract information.

Let's consider a simple example. Imagine that we want to predict the market price of a house in a given city. We can think of many variables that would be predictors of the price of a house: the number of rooms or bathrooms, the neighborhood, the surface, the heating system, and so on. These variables are called features, attributes, or predictors. The value that we want to predict is called the outcome or the target.

If we want our predictions to be reliable, we need several features. Predicting the price of a house based on its surface alone would not be very efficient. Many other factors influence the price of a house and our dataset should include as many as possible (with conditions).

It's often possible to add large numbers of attributes to a model to try to improve the predictions. For instance, in our housing pricing prediction, we could add all the characteristics of the house (bathroom, superficies, heating system, the number of windows). Some of these variables would bring more information to our pricing model and increase the accuracy of our predictions, while others would just add noise and confuse the algorithm. Adding new variables to a predicting model does not always improve the predictions.

In order to make reliable predictions, each of the new features you bring to your model must bring some valuable piece of information. However, this is also not always the case. As we will see in Chapter 2, Machine Learning Definitions and Concepts, correlated predictors can hurt the performances of the model.

Predictive analytics is built on several assumptions and conditions:

  • The value you are trying to predict is predictable and not just some random noise.
  • You have access to data that has some degree of association to the target.
  • The available dataset is large enough. Reliable predictions cannot be inferred from a dataset that is too small. (For instance, you can define and therefore predict a line with two points but you cannot infer data that follows a sine curve from only two points.)
  • The new data you will base future predictions on is similar to the one you parameterized and trained your model on.

You may have a great dataset, but that does not mean it will be efficient for predictions.

These conditions on the data are very general. In the case of SGD, the conditions are more constrained.

主站蜘蛛池模板: 黎平县| 仪陇县| 泊头市| 翼城县| 茶陵县| 奎屯市| 湖南省| 石泉县| 乌审旗| 屏东市| 古田县| 吉水县| 屯昌县| 大理市| 晋城| 八宿县| 河东区| 望江县| 浏阳市| 牡丹江市| 香格里拉县| 涟水县| 正镶白旗| 铜鼓县| 武穴市| 新沂市| 吴旗县| 凯里市| 南昌市| 浪卡子县| 临清市| 偏关县| 定襄县| 丹棱县| 天峻县| 临朐县| 崇文区| 思南县| 河间市| 旬阳县| 根河市|