- Python Data Science Essentials
- Alberto Boschetti Luca Massaron
- 675字
- 2021-08-13 15:19:39
Dealing with problematic data
Now, you should be more confident with the basics of the process and be ready to face datasets that are more problematic, since it is very common to have messy data in reality. Consequently, let's see what happens if the CSV file contains a header and some missing values and dates. For example, to make our example realistic, let's imagine the situation of a travel agency:
- According to the temperature of three popular destinations, they record whether the user picks the first, second, or third destination:
Date,Temperature_city_1,Temperature_city_2,Temperature_city_3,Which_destination
20140910,80,32,40,1
20140911,100,50,36,2
20140912,102,55,46,1
20140912,60,20,35,3
20140914,60,,32,3
20140914,,57,42,2
- In this case, all the numbers are integers and the header is in the file. In our first attempt to load this dataset, we can provide the following command:
In: import pandas as pd
In: fake_dataset = pd.read_csv('a_loading_example_1.csv', sep=',')
fake_dataset
The top rows of the fake_dataset are printed:

Pandas automatically gave the columns their actual name after picking them from the first data row. We first detect a problem: all of the data, even the dates, have been parsed as integers (or, in other cases, as strings). If the format of the dates is not very strange, you can try the auto-detection routines that specify the column that contains the date data. In the following example, it works well when using the following arguments:
In: fake_dataset = pd.read_csv('a_loading_example_1.csv',
parse_dates=[0])
fake_dataset
Here is the fake_dataset whose date column is now correctly interpreted by the read_csv:

Now, in order to get rid of the missing values that are indicated by NaN, replace them with a more meaningful number (let's say, 50 Fahrenheit). We can execute our command in the following way:
In: fake_dataset.fillna(50)
At this point you will notice that there no more missing variables:

After that, all of the missing data has disappeared and it has been replaced by the constant 50.0. Treating missing data can also require different approaches. As an alternative to the previous command, values can be replaced by a negative constant value to mark the fact that they are different from others (and leave the guess for the learning algorithm):
In: fake_dataset.fillna(-1)
NaN values can also be replaced by the column's mean or median value as a way to minimize the guessing error:
In: fake_dataset.fillna(fake_dataset.mean(axis=0))
The .mean method calculates the mean of the specified axis.
The .median method is analogous to .mean, but it computes the median value, which is useful if the mean is not a very good representation of the central value in the data, given a too skewed distribution (for instance, when there are many extreme values in your feature).
Another possible problem when handling real-world datasets is when loading a dataset containing errors or bad lines. In this case, the default behavior of the read_csv method is to stop and raise an exception. A possible workaround, which is feasible when erroneous examples are not the majority, is to ignore the lines causing exceptions. In many cases, such a choice has the sole implication of training the machine learning algorithm without the erroneous observations. As an example, let's say that you have a badly formatted dataset and you want to load just all the good lines and ignore the badly formatted ones.
This is now your a_loading_example_2.csv file:
Val1,Val2,Val3
0,0,0
1,1,1
2,2,2,2
3,3,3
And here is what you can do with the error_bad_lines option:
In: bad_dataset = pd.read_csv('a_loading_example_2.csv',
error_bad_lines=False)
bad_dataset
Out: Skipping line 4: expected 3 fields, saw 4
The resulting output has the fourth line skipped because it has four values instead of three:

- Dreamweaver CS3 Ajax網頁設計入門與實例詳解
- 程序設計缺陷分析與實踐
- 21天學通C語言
- 大數據驅動的機械裝備智能運維理論及應用
- 單片機C語言應用100例
- SQL Server數據庫應用基礎(第2版)
- Hadoop Beginner's Guide
- 計算機應用基礎實訓·職業模塊
- 手把手教你學Photoshop CS3
- 納米集成電路制造工藝(第2版)
- ROS Robotics By Example(Second Edition)
- 人工智能:重塑個人、商業與社會
- 大話數據科學:大數據與機器學習實戰(基于R語言)
- 實戰大數據(Hadoop+Spark+Flink):從平臺構建到交互式數據分析(離線/實時)
- Cloud Native Development Patterns and Best Practices