官术网_书友最值得收藏!

Applying descriptive statistics

Having preprocessed the dataset, let's do some sanity checking using descriptive statistics techniques. 

We can implement this as shown here: 

dfs.info()

The output of the preceding code is as follows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37554 entries, 1 to 78442
Data columns (total 6 columns):
subject 37367 non-null object
from 37554 non-null object
date 37554 non-null datetime64[ns, UTC]
to 36882 non-null object
label 36962 non-null object
thread 37554 non-null object
dtypes: datetime64[ns, UTC](1), object(5)
memory usage: 2.0+ MB

We will learn more about descriptive statistics in Chapter 5Descriptive Statistics. Note that there are 37,554 emails, with each email containing six columns—subject, from, date, to, label, and thread. Let's check the first few entries of the email dataset:

dfs.head(10)

The output of the preceding code is as follows:

Note that our dataframe so far contains six different columns. Take a look at the from field: it contains both the name and the email. For our analysis, we only need an email address. We can use a regular expression to refactor the column. 

主站蜘蛛池模板: 大宁县| 甘肃省| 吉木乃县| 吉林省| 呼伦贝尔市| 泗水县| 科技| 河间市| 鄯善县| 贡嘎县| 山阳县| 横山县| 兰西县| 兖州市| 龙州县| 太保市| 巧家县| 宣威市| 来凤县| 巴马| 商河县| 武定县| 宜昌市| 吴江市| 宁陵县| 佛冈县| 临洮县| 乌拉特后旗| 永丰县| 射洪县| 基隆市| 柯坪县| 金寨县| 茂名市| 雷波县| 揭西县| 榆中县| 达州市| 汶上县| 开江县| 从化市|