官术网_书友最值得收藏!

Upsampling time series data

In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.

Let's start with hourly data for a single day:

>>> rng = pd.date_range('4/29/2015 8:00', periods=10, freq='H')
>>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng)
>>> ts.head()
2015-04-29 08:00:00 30
2015-04-29 09:00:00 27
2015-04-29 10:00:00 54
2015-04-29 11:00:00 9
2015-04-29 12:00:00 48
Freq: H, dtype: int64

If we upsample to data points taken every 15 minutes, our time series will be extended with NaN values:

>>> ts.resample('15min')
>>> ts.head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 NaN
2015-04-29 08:30:00 NaN
2015-04-29 08:45:00 NaN
2015-04-29 09:00:00 27

There are various ways to deal with missing values, which can be controlled by the fill_method keyword argument to resample. Values can be filled either forward or backward:

>>> ts.resample('15min', fill_method='ffill').head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 30
2015-04-29 08:30:00 30
2015-04-29 08:45:00 30
2015-04-29 09:00:00 27
Freq: 15T, dtype: int64
>>> ts.resample('15min', fill_method='bfill').head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 27
2015-04-29 08:30:00 27
2015-04-29 08:45:00 27
2015-04-29 09:00:00 27

With the limit parameter, it is possible to control the number of missing values to be filled:

>>> ts.resample('15min', fill_method='ffill', limit=2).head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 30
2015-04-29 08:30:00 30
2015-04-29 08:45:00 NaN
2015-04-29 09:00:00 27
Freq: 15T, dtype: float64

If you want to adjust the labels during resampling, you can use the loffset keyword argument:

>>> ts.resample('15min', fill_method='ffill', limit=2, loffset='5min').head()
2015-04-29 08:05:00 30
2015-04-29 08:20:00 30
2015-04-29 08:35:00 30
2015-04-29 08:50:00 NaN
2015-04-29 09:05:00 27
Freq: 15T, dtype: float64

There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.

We can ask Pandas to interpolate a time series for us:

>>> tsx = ts.resample('15min')
>>> tsx.interpolate().head()
2015-04-29 08:00:00 30.00
2015-04-29 08:15:00 29.25
2015-04-29 08:30:00 28.50
2015-04-29 08:45:00 27.75
2015-04-29 09:00:00 27.00
Freq: 15T, dtype: float64

We saw the default interpolate method – a linear interpolation – in action. Pandas assumes a linear relationship between two existing points.

Pandas supports over a dozen interpolation functions, some of which require the scipy library to be installed. We will not cover interpolation methods in this chapter, but we encourage you to explore the various methods yourself. The right interpolation method will depend on the requirements of your application.

主站蜘蛛池模板: 麻江县| 西安市| 得荣县| 晋江市| 庐江县| 团风县| 桓台县| 夏津县| 松溪县| 中牟县| 左云县| 延庆县| 宁乡县| 新巴尔虎左旗| 兖州市| 灵台县| 临沭县| 高邑县| 武宣县| 崇礼县| 临沂市| 凉山| 乾安县| 乡城县| 石家庄市| 沂源县| 垦利县| 上饶县| 光泽县| 乌兰浩特市| 武清区| 丰宁| 石楼县| 蒙城县| 西乌珠穆沁旗| 西林县| 隆昌县| 封开县| 湖北省| 友谊县| 青铜峡市|