- Learning Data Mining with Python(Second Edition)
- Robert Layton
- 595字
- 2021-07-02 23:40:06
Loading the dataset
The dataset Ionosphere, which high-frequency antennas. The aim of the antennas is to determine whether there is a structure in the ionosphere and a region in the upper atmosphere. We consider readings with a structure to be good, while those that do not have structure are deemed bad. The aim of this application is to build a data mining classifier that can determine whether an image is good or bad.

(Image Credit: https://www.flickr.com/photos/geckzilla/16149273389/)
You can download this dataset for different data mining applications. Go to http://archive.ics.uci.edu/ml/datasets/Ionosphere and click on Data Folder. Download the ionosphere.data and ionosphere.names files to a folder on your computer. For this example, I'll assume that you have put the dataset in a directory called Data in your home folder. You can place the data in another folder, just be sure to update your data folder (here, and in all other chapters).
The location of your home folder depends on your operating system. For Windows, it is usually at C:Documents and Settingsusername. For Mac or Linux machines, it is usually at /home/username. You can get your home folder by running this python code inside a Jupyter Notebook:
import os print(os.path.expanduser("~"))
For each row in the dataset, there are 35 values. The first 34 are measurements taken from the 17 antennas (two values for each antenna). The last is either 'g' or 'b'; that stands for good and bad, respectively.
Start the Jupyter Notebook server and create a new notebook called Ionosphere Nearest Neighbors. To start with, we load up the NumPy and csv libraries that we will need for our code, and set the data's filename that we will need for our code.
import numpy as np
import csv
data_filename = "data/ionosphere.data"
We then create the X and y NumPy arrays to store the dataset in. The sizes of these arrays are known from the dataset. Don't worry if you don't know the size of future datasets - we will use other methods to load the dataset in future chapters and you won't need to know this size beforehand:
X = np.zeros((351, 34), dtype='float')
y = np.zeros((351,), dtype='bool')
The dataset is in a Comma-Separated Values (CSV) format, which is a commonly used format for datasets. We are going to use the csv module to load this file. Import it and set up a csv reader object, then loop through the file, setting the appropriate row in X and class value in y for every line in our dataset:
with open(data_filename, 'r') as input_file:
reader = csv.reader(input_file)
for i, row in enumerate(reader):
# Get the data, converting each item to a float
data = [float(datum) for datum in row[:-1]]
# Set the appropriate row in our dataset
X[i] = data
# 1 if the class is 'g', 0 otherwise
y[i] = row[-1] == 'g'
We now have a dataset of samples and features in X as well as the corresponding classes in y, as we did in the classification example in Chapter 1, Getting Started with Data Mining.
To begin with, try applying the OneR algorithm from Chapter 1, Getting Started with Data Mining to this dataset. It won't work very well, as the information in this dataset is spread out within the correlations of certain features. OneR is only interested in the values of a single feature and cannot pick up information in more complex datasets very well. Other algorithms, including Nearest Neighbor, merge information from multiple features, making them applicable in more scenarios. The downside is that they are often more computationally expensive to compute.
- 計算機圖形學編程(使用OpenGL和C++)(第2版)
- Visual Basic程序開發(學習筆記)
- Web交互界面設計與制作(微課版)
- 64位匯編語言的編程藝術
- 基于差分進化的優化方法及應用
- UML 基礎與 Rose 建模案例(第3版)
- Extending Puppet(Second Edition)
- Python圖形化編程(微課版)
- Learning Apache Cassandra
- Android開發三劍客:UML、模式與測試
- jQuery技術內幕:深入解析jQuery架構設計與實現原理
- PHP Microservices
- C++面向對象程序設計
- Java核心技術速學版(第3版)
- Node.js Web Development