官术网_书友最值得收藏!

Building more complex classifiers

In the previous section, we used a very simple model: a threshold on one of the dimensions. Throughout this book, you will see many other types of models, and we're not even going to cover everything that is out there.

What makes up a classification model? We can break it up into three parts:

  • The structure of the model: In this, we use a threshold on a single feature.
  • The search procedure: In this, we try every possible combination of feature and threshold.
  • The loss function: Using the loss function, we decide which of the possibilities is less bad (because we can rarely talk about the perfect solution). We can use the training error or just define this point the other way around and say that we want the best accuracy. Traditionally, people want the loss function to be minimum.

We can play around with these parts to get different results. For example, we can attempt to build a threshold that achieves minimal training error, but we will only test three values for each feature: the mean value of the features, the mean plus one standard deviation, and the mean minus one standard deviation. This could make sense if testing each value was very costly in terms of computer time (or if we had millions and millions of datapoints). Then the exhaustive search we used would be infeasible, and we would have to perform an approximation like this.

Alternatively, we might have different loss functions. It might be that one type of error is much more costly than another. In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests for confirmation purposes or unnecessary treatment (which can still have costs, including side effects from the treatment). Therefore, depending on the exact setting, different trade-offs can make sense. At one extreme, if the disease is fatal and treatment is cheap with very few negative side effects, you want to minimize the false negatives as much as you can. With spam filtering, we may face the same problem; incorrectly deleting a non-spam e-mail can be very dangerous for the user, while letting a spam e-mail through is just a minor annoyance.

Tip

What the cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes (achieving the highest accuracy). However, if some mistakes are more costly than others, it might be better to accept a lower overall accuracy to minimize overall costs.

Finally, we can also have other classification structures. A simple threshold rule is very limiting and will only work in the very simplest cases, such as with the Iris dataset.

主站蜘蛛池模板: 牙克石市| 黄梅县| 拜城县| 奉贤区| 从化市| 汉中市| 巴楚县| 上栗县| 赣榆县| 武清区| 西畴县| 昭平县| 龙江县| 大兴区| 阿合奇县| 周至县| 阳朔县| 山阳县| 喜德县| 海伦市| 增城市| 罗甸县| 洪泽县| 洪湖市| 巴青县| 宁德市| 元阳县| 黔南| 邵武市| 乐业县| 留坝县| 北京市| 古蔺县| 东台市| 交口县| 铜山县| 买车| 邳州市| 尖扎县| 南澳县| 西乌珠穆沁旗|