官术网_书友最值得收藏!

Learning and classification

When we want to automatically identify to which category a specific value (categorical value) belongs, we need to implement an algorithm that can predict the most likely category for the value, based on the previous data. This is called Classification. In the words of Tom Mitchell:

"How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?"

The keyword here is learning (supervised learning in this case), and also how to train an algorithm to identify categorical elements. The common examples are spam classification , speech recognition , search engines , computer vision , and language detection ; but there are a large number of applications for a classifier. We can find two kinds of problems in classification. The binary classification is where we have only two categories (spam or not spam) and multiclass classification is where many categories are involved (for example, opinions can be positive, neutral, negative, and so on). We can find several algorithms for classification, the most frequently used are support vector machines, neural networks , decision trees , Na?ve Bayes , and hidden Markov models . In this chapter, we will implement a probabilistic classification using Na?ve Bayes algorithm, but in the following chapters we will implement several other classification algorithms for a variety of problems.

The general steps involved in supervised classification are shown in the following figure. First we will collect training data (previously classified), then we will perform feature extraction (relevant features for the categorization). Next, we will train the algorithm with the features vector. Once we get our trained classifier, we may insert new strings, extract their features, and send them to the classifier. Finally, the classifier will give us the most likely class (category) for the new string.

Additionally we will test the classifier accuracy by using a hand-classified test set. Due to this, we will split the data into two sets, the training data and the test data.

主站蜘蛛池模板: 仙游县| 临沧市| 本溪市| 政和县| 嘉善县| 吐鲁番市| 武定县| 长春市| 武定县| 久治县| 宁化县| 龙游县| 五河县| 陕西省| 忻州市| 蓬莱市| 灵山县| 米泉市| 桐城市| 游戏| 惠安县| 蛟河市| 安多县| 高雄市| 哈密市| 赤水市| 东安县| 麻城市| 连江县| 卓尼县| 天峻县| 遵义市| 龙岩市| 青神县| 平安县| 徐闻县| 安泽县| 蒙山县| 常山县| 赤水市| 华池县|