官术网_书友最值得收藏!

Applying descriptive statistics

Having preprocessed the dataset, let's do some sanity checking using descriptive statistics techniques. 

We can implement this as shown here: 

dfs.info()

The output of the preceding code is as follows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37554 entries, 1 to 78442
Data columns (total 6 columns):
subject 37367 non-null object
from 37554 non-null object
date 37554 non-null datetime64[ns, UTC]
to 36882 non-null object
label 36962 non-null object
thread 37554 non-null object
dtypes: datetime64[ns, UTC](1), object(5)
memory usage: 2.0+ MB

We will learn more about descriptive statistics in Chapter 5Descriptive Statistics. Note that there are 37,554 emails, with each email containing six columns—subject, from, date, to, label, and thread. Let's check the first few entries of the email dataset:

dfs.head(10)

The output of the preceding code is as follows:

Note that our dataframe so far contains six different columns. Take a look at the from field: it contains both the name and the email. For our analysis, we only need an email address. We can use a regular expression to refactor the column. 

主站蜘蛛池模板: 加查县| 栾城县| 凭祥市| 荔波县| 绥宁县| 饶阳县| 吴旗县| 蒙山县| 襄垣县| 石渠县| 东乌珠穆沁旗| 建瓯市| 馆陶县| 安化县| 聂拉木县| 井冈山市| 余庆县| 阜新市| 吴江市| 桐城市| 句容市| 大足县| 美姑县| 雅江县| 岳阳县| 遵义市| 惠来县| 宜昌市| 宁远县| 高州市| 惠东县| 宿州市| 固安县| 方正县| 阿瓦提县| 察哈| 丰都县| 望江县| 阿拉尔市| 炎陵县| 罗江县|