官术网_书友最值得收藏!

Applying descriptive statistics

Having preprocessed the dataset, let's do some sanity checking using descriptive statistics techniques. 

We can implement this as shown here: 

dfs.info()

The output of the preceding code is as follows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37554 entries, 1 to 78442
Data columns (total 6 columns):
subject 37367 non-null object
from 37554 non-null object
date 37554 non-null datetime64[ns, UTC]
to 36882 non-null object
label 36962 non-null object
thread 37554 non-null object
dtypes: datetime64[ns, UTC](1), object(5)
memory usage: 2.0+ MB

We will learn more about descriptive statistics in Chapter 5Descriptive Statistics. Note that there are 37,554 emails, with each email containing six columns—subject, from, date, to, label, and thread. Let's check the first few entries of the email dataset:

dfs.head(10)

The output of the preceding code is as follows:

Note that our dataframe so far contains six different columns. Take a look at the from field: it contains both the name and the email. For our analysis, we only need an email address. We can use a regular expression to refactor the column. 

主站蜘蛛池模板: 子洲县| 宁陵县| 福建省| 牙克石市| 越西县| 南宫市| 开化县| 武汉市| 长丰县| 醴陵市| 红原县| 澄江县| 千阳县| 定结县| 江津市| 梁河县| 九龙城区| 丰镇市| 裕民县| 金秀| 个旧市| 增城市| 巢湖市| 东兰县| 尚志市| 鄂温| 乐至县| 东明县| 肇源县| 且末县| 栖霞市| 平遥县| 贺州市| 通辽市| 桓台县| 镇远县| 栾城县| 永新县| 义乌市| 炉霍县| 呼图壁县|