官术网_书友最值得收藏!

Preprocessing the data

Now let's open up the Jupyter Notebook and get started on our first program, using the methods that we discussed in the previous section:

  1. The first thing we need to do is load the various libraries that we need. We will also load the iris dataset from the scikit-learn library, using the following code:

  1. After importing all the required libraries and the dataset, we will go ahead and create an object called iris_obj, which loads the iris dataset into an object. Then, we will go ahead and use the data method to preview the dataset; and this results in the following output:

Notice that it's a NumPy array. This contains a lot of the data that we want, and each of these columns corresponds to a feature.

  1. We will now see what those feature names are in the following output:

As you can see here, the first column shows the sepal length, the next column shows the sepal width, the third column shows the petal length, and the final column shows the petal width.

  1. Now, there is a fifth column that is not displayed hereit's referred to as the target column. This is stored in a separate array; we will now look at this column as follows:

This displays the target column in an array.

  1. Now, if you want to see the labels of the array header, we can use the following code:

As you can see, the target column consists of data with three different labels. The flowers come from either the setosa, the versicolor, or the virginica species.

  1. Our next step is to take this dataset and turn it into a pandas DataFrame, using the following code:

This results in the following output:

As you can see, we have successfully loaded the data into a DataFrame.

  1. We can see that the species column still shows the various species using numeric values. So, we will replace the final column, which indicates the various species, with strings that indicate the values, rather than numbers, using the following code block:

The following screenshot shows the result:

As you can see, the species column now has the actual species namesthis makes it much easier to work with the data.

Now, for this dataset, the fact that each flower comes from a different species suggests that we may want to actually group the data when we're doing statistical summariestherefore, we can try grouping by species.

  1. So, we will now group the dataset values using the species column as the anchor, and then print out the details of each group to make sure that everything is working. We will use the following lines of code to do so:

This results in the following output:

Now that the data has been loaded and set up, we will use it to perform some basic statistical operations in the next section.

主站蜘蛛池模板: 洛南县| 宝应县| 武穴市| 辽阳市| 固安县| 海晏县| 类乌齐县| 兴业县| 陆川县| 新绛县| 泸定县| 胶州市| 新昌县| 阳原县| 咸丰县| 巴林左旗| 施甸县| 靖宇县| 渝北区| 青铜峡市| 绩溪县| 宁海县| 祁阳县| 白玉县| 涟源市| 邵阳县| 德兴市| 邹城市| 澄迈县| 萝北县| 岱山县| 阿鲁科尔沁旗| 郁南县| 砀山县| 葫芦岛市| 高平市| 灵石县| 咸阳市| 姚安县| 南乐县| 石楼县|