官术网_书友最值得收藏!

Learning and classification

When we want to automatically identify to which category a specific value (categorical value) belongs, we need to implement an algorithm that can predict the most likely category for the value, based on the previous data. This is called Classification. In the words of Tom Mitchell:

"How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?"

The keyword here is learning (supervised learning in this case), and also how to train an algorithm to identify categorical elements. The common examples are spam classification , speech recognition , search engines , computer vision , and language detection ; but there are a large number of applications for a classifier. We can find two kinds of problems in classification. The binary classification is where we have only two categories (spam or not spam) and multiclass classification is where many categories are involved (for example, opinions can be positive, neutral, negative, and so on). We can find several algorithms for classification, the most frequently used are support vector machines, neural networks , decision trees , Na?ve Bayes , and hidden Markov models . In this chapter, we will implement a probabilistic classification using Na?ve Bayes algorithm, but in the following chapters we will implement several other classification algorithms for a variety of problems.

The general steps involved in supervised classification are shown in the following figure. First we will collect training data (previously classified), then we will perform feature extraction (relevant features for the categorization). Next, we will train the algorithm with the features vector. Once we get our trained classifier, we may insert new strings, extract their features, and send them to the classifier. Finally, the classifier will give us the most likely class (category) for the new string.

Additionally we will test the classifier accuracy by using a hand-classified test set. Due to this, we will split the data into two sets, the training data and the test data.

主站蜘蛛池模板: 漠河县| 屏南县| 玉门市| 凤庆县| 平昌县| 读书| 信丰县| 罗田县| 夏邑县| 安岳县| 富宁县| 田林县| 长汀县| 成武县| 休宁县| 巴林左旗| 白山市| 阿拉尔市| 本溪| 遵义县| 焦作市| 霍邱县| 承德市| 潜江市| 碌曲县| 嘉荫县| 仪陇县| 淳化县| 涟水县| 祁东县| 尉犁县| 抚松县| 孝昌县| 晴隆县| 酒泉市| 通化市| 建阳市| 醴陵市| 南澳县| 祥云县| 静宁县|