官术网_书友最值得收藏!

Loading and preparing the dataset

The dataset we are going to use for this example is the famous Iris database of plant classification. In this dataset, we have 150 plant samples and four measurements of each: sepal length, sepal width, petal length, and petal width (all in centimeters). This classic dataset (first used in 1936!) is one of the classic datasets for data mining. There are three classes: Iris Setosa, Iris Versicolour, and Iris Virginica. The aim is to determine which type of plant a sample is, by examining its measurements.

The scikit-learn library contains this dataset built-in, making the loading of the dataset straightforward:

from sklearn.datasets import load_iris 
dataset = load_iris()
X = dataset.data
y = dataset.target

You can also print(dataset.DESCR) to see an outline of the dataset, including some details about the features.

The features in this dataset are continuous values, meaning they can take any range of values. Measurements are a good example of this type of feature, where a measurement can take the value of 1, 1.2, or 1.25 and so on. Another aspect of continuous features is that feature values that are close to each other indicate similarity. A plant with a sepal length of 1.2 cm is like a plant with a Sepal width of 1.25 cm.

In contrast are categorical features. These features, while often represented as numbers, cannot be compared in the same way. In the Iris dataset, the class values are an example of a categorical feature. The class 0 represents Iris Setosa; class 1 represents Iris Versicolour, and class 2 represents Iris Virginica. The numbering doesn't mean that Iris Setosa is more similar to Iris Versicolour than it is to Iris Virginica-despite the class value being more similar. The numbers here represent categories. All we can say is whether categories are the same or different.

There are other types of features too, which we will cover in later chapters. These include pixel intensity, word frequency and n-gram analysis.

While the features in this dataset are continuous, the algorithm we will use in this example requires categorical features. Turning a continuous feature into a categorical feature is a process called discretization.

A simple discretization algorithm is to choose some threshold, and any values below this threshold are given a value 0. Meanwhile, any above this are given the value 1. For our threshold, we will compute the mean (average) value for that feature. To start with, we compute the mean for each feature:

attribute_means = X.mean(axis=0)

The result from this code will be an array of length 4, which is the number of features we have. The first value is the mean of the values for the first feature and so on. Next, we use this to transform our dataset from one with continuous features to one with discrete categorical features:

assert attribute_means.shape == (n_features,)
X_d = np.array(X >= attribute_means, dtype='int')

We will use this new X_d dataset (for X discretized) for our training and testing, rather than the original dataset (X).

主站蜘蛛池模板: 凤翔县| 山西省| 疏勒县| 安龙县| 玉林市| 阿克陶县| 清新县| 花莲县| 上犹县| 修文县| 荣昌县| 满城县| 辉南县| 慈溪市| 巴青县| 观塘区| 青海省| 治多县| 区。| 深圳市| 平昌县| 定州市| 托克逊县| 墨脱县| 怀柔区| 崇明县| 龙井市| 二连浩特市| 湖南省| 河津市| 鄯善县| 日喀则市| 黄浦区| 武威市| 册亨县| 株洲市| 河北省| 和硕县| 金门县| 碌曲县| 扶绥县|