官术网_书友最值得收藏!

Analyzing the effect of outliers

Just to prove a point, let's add in an outlier. We'll take Donald Trump; I think he qualifies as an outlier. Let's go ahead and add his income in. So I'm going to manually add this to the data using np.append, and let's say add a billion dollars (which is obviously not the actual income of Donald Trump) into the incomes data.

incomes = np.append(incomes, [1000000000]) 

What we're going to see is that this outlier doesn't really change the median a whole lot, you know, that's still going to be around the same value $26,911, because we didn't actually change where the middle point is, with that one value, as shown in the following example:

np.median(incomes) 

This will output the following:

Out[5]: 26911.948365056276 

This gives a new output of:

np.mean(incomes) 

The following is the output of the preceding code:

Out[5]:127160.38252311043 

Aha, so there you have it! It is a great example of how median and mean, although people tend to equate them in commonplace language, can be very different, and tell a very different story. So that one outlier caused the average income in this dataset to be over $127160 a year, but the more accurate picture is closer to 27,000 dollars a year for the typical person in this dataset. We just had the mean skewed by one big outlier.

The moral of the story is: take anyone who talks about means or averages with a grain of salt if you suspect there might be outliers involved, and income distribution is definitely a case of that.

主站蜘蛛池模板: 万全县| 新平| 土默特右旗| 府谷县| 西畴县| 元谋县| 郴州市| 黑山县| 宁强县| 海口市| 蚌埠市| 通辽市| 西畴县| 深泽县| 大冶市| 临泽县| 东辽县| 柏乡县| 太仓市| 沽源县| 扎赉特旗| 卓尼县| 丰城市| 开平市| 堆龙德庆县| 敦化市| 茌平县| 资中县| 上林县| 柳江县| 多伦县| 于田县| 富蕴县| 靖远县| 垣曲县| 普兰店市| 长海县| 旅游| 龙江县| 从江县| 汉源县|