官术网_书友最值得收藏!

Loading and preparing the dataset

The dataset we are going to use for this example is the famous Iris database of plant classification. In this dataset, we have 150 plant samples and four measurements of each: sepal length, sepal width, petal length, and petal width (all in centimeters). This classic dataset (first used in 1936!) is one of the classic datasets for data mining. There are three classes: Iris Setosa, Iris Versicolour, and Iris Virginica. The aim is to determine which type of plant a sample is, by examining its measurements.

The scikit-learn library contains this dataset built-in, making the loading of the dataset straightforward:

from sklearn.datasets import load_iris 
dataset = load_iris()
X = dataset.data
y = dataset.target

You can also print(dataset.DESCR) to see an outline of the dataset, including some details about the features.

The features in this dataset are continuous values, meaning they can take any range of values. Measurements are a good example of this type of feature, where a measurement can take the value of 1, 1.2, or 1.25 and so on. Another aspect of continuous features is that feature values that are close to each other indicate similarity. A plant with a sepal length of 1.2 cm is like a plant with a Sepal width of 1.25 cm.

In contrast are categorical features. These features, while often represented as numbers, cannot be compared in the same way. In the Iris dataset, the class values are an example of a categorical feature. The class 0 represents Iris Setosa; class 1 represents Iris Versicolour, and class 2 represents Iris Virginica. The numbering doesn't mean that Iris Setosa is more similar to Iris Versicolour than it is to Iris Virginica-despite the class value being more similar. The numbers here represent categories. All we can say is whether categories are the same or different.

There are other types of features too, which we will cover in later chapters. These include pixel intensity, word frequency and n-gram analysis.

While the features in this dataset are continuous, the algorithm we will use in this example requires categorical features. Turning a continuous feature into a categorical feature is a process called discretization.

A simple discretization algorithm is to choose some threshold, and any values below this threshold are given a value 0. Meanwhile, any above this are given the value 1. For our threshold, we will compute the mean (average) value for that feature. To start with, we compute the mean for each feature:

attribute_means = X.mean(axis=0)

The result from this code will be an array of length 4, which is the number of features we have. The first value is the mean of the values for the first feature and so on. Next, we use this to transform our dataset from one with continuous features to one with discrete categorical features:

assert attribute_means.shape == (n_features,)
X_d = np.array(X >= attribute_means, dtype='int')

We will use this new X_d dataset (for X discretized) for our training and testing, rather than the original dataset (X).

主站蜘蛛池模板: 三原县| 余庆县| 吉木萨尔县| 永胜县| 鹿邑县| 农安县| 墨玉县| 耒阳市| 涞源县| 吉水县| 衡水市| 和平县| 杭锦旗| 崇义县| 延川县| 信丰县| 遂宁市| 张北县| 宝山区| 祁阳县| 博白县| 通海县| 余庆县| 桐柏县| 永丰县| 青浦区| 马边| 尖扎县| 建湖县| 诸城市| 锦州市| 北川| 鹤山市| 丰镇市| 奉节县| 兴宁市| 南和县| 大厂| 潞西市| 漠河县| 湟源县|