Python Linear Interpolation Of Values In Dataframe
Solution 1:
Use df.asfreq
to expand the DataFrame so as to have an hourly frequency. NaN is inserted for missing values:
df = df.asfreq('H')
then use df.interpolate
to replace the NaNs with (linearly) interpolated values based on the DatetimeIndex and the nearest non-NaN values:
df = df.interpolate(method='time')
For example,
import numpy as np
import pandas as pd
N, M = 744, 734index = pd.date_range('2015-01-01', periods=N, freq='H')
idx = np.random.choice(np.arange(N), M, replace=False)
idx.sort()
index = index[idx]
# This creates a toy DataFrame with 734 non-null rows:
df = pd.DataFrame({'values': np.random.randint(10, size=(M,))}, index=index)
# This expands the DataFrame to 744 rows (10 null rows):
df = df.asfreq('H')# This makes `df` have 744 non-null rows:
df = df.interpolate(method='time')
Solution 2:
What you want requires a combination of this technique: Add missing dates to pandas dataframe
And the pandas function pandas.Series.interpolate
. From what you've said, the option 'linear' is what you want.
EDIT: Interpolate will not work in the case were you have datapoints missing at the very start of the time series. One idea is to use pandas.Series.fillna with 'backfill' after the interpolation. Also, do not set fill_value to 0 whe you call reindex
Solution 3:
A general interpolation is the following:
If the key exits:
- Return the value
else:
- Find the first key before and after the required key, find the distance (which you can define using a desired metric) to both keys and take a weighted average of the values, weighed by the distances of the keys (close is heigher weight).
Post a Comment for "Python Linear Interpolation Of Values In Dataframe"