官术网_书友最值得收藏!

Exploratory analysis

Before starting with data analysis through the classification algorithm, we will conduct an exploratory analysis to understand how the data is distributed and extract preliminary knowledge. To display the first twenty rows of the DataFrame that's been imported, we can use the head() function, as follows:

print(data.head(20))

The following results are returned:

The first 20 rows are displayed. This function returns the first n rows for the object, based on position. This is useful for quickly testing whether your object has the right type of data in it. Now the dataset is available in our Python environment. To extract some information, we can invoke the info() function, as follows:

print(Data.info())

This method prints a concise summary of a DataFrame, including the dtypes index and dtypes column, non-null values, and memory usage. The following results are returned:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302 entries, 0 to 301
Data columns (total 14 columns):
age 302 non-null int64
sex 302 non-null int64
cp 302 non-null int64
trestbps 302 non-null int64
chol 302 non-null int64
fbs 302 non-null int64
restecg 302 non-null int64
thalach 302 non-null int64
exang 302 non-null int64
oldpeak 302 non-null float64
slope 302 non-null int64
ca 302 non-null object
hal 302 non-null object
HeartDisease 302 non-null int64
dtypes: float64(1), int64(11), object(2)
memory usage: 33.1+ KB
None

Useful information is reported. The number of entries is 302, and the number of data columns is 14. Essentially, with regard to the list of all features with the number of elements, the possible presence of data and the type is returned. In this way, we can already get an idea of the type of variables we are about to analyze. In fact, analyzing the results that we've obtained, we can note that three types have been identified: float64(1), int64(11), and object(2). For the first two, there are no doubts: these are integer and real numbers. This anomaly is represented by the two columns labeled as objects. To understand what happened, it is useful to check the types of data provided by the pandas library, as shown in the following table:

 

Now, everything is clear: the two columns have been labeled as containing text. Why did this happen? This problem is due to the presence of missing values. Keep this in mind, as we will have to deal with this problem before proceeding with the construction of the model.

To get a preview of the data contained in it, we can calculate a series of basic statistics. To do so, we will use the describe() function in the following way:

summary = Data.describe()
print(summary)

The following results are returned:

The describe() function generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution, excluding NaN values. It analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary, depending on what is provided. To continue, it is therefore necessary to address the problem of missing values.

主站蜘蛛池模板: 盐山县| 延安市| 萍乡市| 台州市| 台中市| 海原县| 湖州市| 宜宾县| 邮箱| 拜泉县| 平舆县| 邳州市| 郴州市| 二手房| 汪清县| 泰州市| 金门县| 桦南县| 博乐市| 博客| 太白县| 和田县| 盐津县| 濮阳市| 星子县| 突泉县| 温宿县| 景德镇市| 光泽县| 阜城县| 崇信县| 定南县| 涞水县| 建阳市| 治多县| 北宁市| 白朗县| 突泉县| 镇平县| 奎屯市| 军事|