官术网_书友最值得收藏!

Dealing with problematic data

Now, you should be more confident with the basics of the process and be ready to face datasets that are more problematic, since it is very common to have messy data in reality. Consequently, let's see what happens if the CSV file contains a header and some missing values and dates. For example, to make our example realistic, let's imagine the situation of a travel agency:

  1. According to the temperature of three popular destinations, they record whether the user picks the first, second, or third destination:
Date,Temperature_city_1,Temperature_city_2,Temperature_city_3,Which_destination
20140910,80,32,40,1
20140911,100,50,36,2
20140912,102,55,46,1
20140912,60,20,35,3
20140914,60,,32,3
20140914,,57,42,2
  1. In this case, all the numbers are integers and the header is in the file. In our first attempt to load this dataset, we can provide the following command:
In: import pandas as pd

In: fake_dataset = pd.read_csv('a_loading_example_1.csv', sep=',')
fake_dataset

The top rows of the fake_dataset are printed:

Pandas automatically gave the columns their actual name after picking them from the first data row. We first detect a problem: all of the data, even the dates, have been parsed as integers (or, in other cases, as strings). If the format of the dates is not very strange, you can try the auto-detection routines that specify the column that contains the date data. In the following example, it works well when using the following arguments:

In: fake_dataset = pd.read_csv('a_loading_example_1.csv', 
parse_dates=[0])
fake_dataset

Here is the fake_dataset whose date column is now correctly interpreted by the read_csv:

Now, in order to get rid of the missing values that are indicated by NaN, replace them with a more meaningful number (let's say, 50 Fahrenheit). We can execute our command in the following way:

In: fake_dataset.fillna(50)

At this point you will notice that there no more missing variables:

After that, all of the missing data has disappeared and it has been replaced by the constant 50.0. Treating missing data can also require different approaches. As an alternative to the previous command, values can be replaced by a negative constant value to mark the fact that they are different from others (and leave the guess for the learning algorithm):

In: fake_dataset.fillna(-1)  
Note that this method only fills missing values in the view of the data (that is, it doesn't modify the original DataFrame). In order to actually change them, use the inplace=True argument command.

NaN values can also be replaced by the column's mean or median value as a way to minimize the guessing error:

In: fake_dataset.fillna(fake_dataset.mean(axis=0))  

The .mean method calculates the mean of the specified axis.

Please note that axis= 0 implies a calculation of means that spans the rows; the consequently obtained means are derived from column-wise computations. Instead, axis=1 spans columns and, therefore, row-wise results are obtained. This works in the same way for all other methods that require the axis parameter, both in pandas and NumPy.

The .median method is analogous to .mean, but it computes the median value, which is useful if the mean is not a very good representation of the central value in the data, given a too skewed distribution (for instance, when there are many extreme values in your feature).

Another possible problem when handling real-world datasets is when loading a dataset containing errors or bad lines. In this case, the default behavior of the read_csv method is to stop and raise an exception. A possible workaround, which is feasible when erroneous examples are not the majority, is to ignore the lines causing exceptions. In many cases, such a choice has the sole implication of training the machine learning algorithm without the erroneous observations. As an example, let's say that you have a badly formatted dataset and you want to load just all the good lines and ignore the badly formatted ones.

This is now your a_loading_example_2.csv file:

Val1,Val2,Val3
0,0,0
1,1,1
2,2,2,2
3,3,3

And here is what you can do with the error_bad_lines option:

In: bad_dataset = pd.read_csv('a_loading_example_2.csv',
error_bad_lines=False)

bad_dataset

Out: Skipping line 4: expected 3 fields, saw 4

The resulting output has the fourth line skipped because it has four values instead of three:

主站蜘蛛池模板: 阳东县| 福州市| 长海县| 娄底市| 清水河县| 皋兰县| 泸溪县| 阿拉善左旗| 甘南县| 新闻| 绥阳县| 泰和县| 房产| 县级市| 雷州市| 乌拉特中旗| 仁布县| 衡阳县| 拜城县| 德安县| 枝江市| 德江县| 勃利县| 兰州市| 朝阳市| 宝坻区| 德惠市| 密山市| 连城县| 涪陵区| 天门市| 宜州市| 湟源县| 阿克苏市| 海兴县| 文安县| 广西| 临高县| 庆云县| 霍林郭勒市| 宁明县|