Resampling Multiple Csv Files And Automatically Saving Resampled Files With New Names
Solution 1:
I think you can re-use the code from our previous work (here). Using the original code, when the NIGHT
and DAY
dataframes are created, you can then resample them on an hourly, daily, and monthly basis and save the new (resampled) dataframes as .csv
files wherever you like.
I am going to use a sample dataframe (first 3 rows shown here):
datesPRpPReNorm_EffSR_GenSR_All2016-01-01 00:00:00 0.2693890.5177200.8586038123.746453 8770.5604672016-01-01 00:15:00 0.2833160.5532030.8622537868.675481 8130.9744092016-01-01 00:30:00 0.2865900.6939970.9484638106.217144 8314.584848
Full Code
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
from random import randint
import random
import calendar
# I defined a sample dataframe with dummy data
start = datetime.datetime(2016,1,1,0,0)
r = range(0,10000)
dates = [start + relativedelta(minutes=15*i) for i in r]
PRp = [random.uniform(.2, .3) for i in r]
PRe = [random.uniform(0.5, .7) for i in r]
Norm_Eff = [random.uniform(0.7, 1) for i in r]
SR_Gen = [random.uniform(7500, 8500) for i in r]
SR_All = [random.uniform(8000, 9500) for i in r]
DF = pd.DataFrame({
'dates': dates,
'PRp': PRp,
'PRe': PRe,
'Norm_Eff': Norm_Eff,
'SR_Gen': SR_Gen,
'SR_All': SR_All,
})
# define when day starts and ends (MUST USE 24 CLOCK)
day = {
'start': datetime.time(6,0), # start at 6am (6:00)'end': datetime.time(18,0) # ends at 6pm (18:00)
}
# capture years that appear in dataframe
min_year = DF.dates.min().year
max_year = DF.dates.max().year
if min_year == max_year:
yearRange = [min_year]
else:
yearRange = range(min_year, max_year+1)
# iterate over each year and each month within each yearfor year in yearRange:
for month inrange(1,13):
# filter to show NIGHT and DAY dataframe for given month within given year
NIGHT = DF[(DF.dates >= datetime.datetime(year, month, 1)) &
(DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) &
((DF.dates.apply(lambda x: x.time()) <= day['start']) | (DF.dates.apply(lambda x: x.time()) >= day['end']))]
DAY = DF[(DF.dates >= datetime.datetime(year, month, 1)) &
(DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) &
((DF.dates.apply(lambda x: x.time()) > day['start']) & (DF.dates.apply(lambda x: x.time()) < day['end']))]
# Create resampled dataframes on Hourly, Daily, Monthly basisfor resample_freq, freq_tag inzip(['H','D','M'], ['Hourly','Daily','Monthly']):
NIGHT.index = NIGHT.dates # resampled column must be placed in index
NIGHT_R = pd.DataFrame(data={
'PRp': NIGHT.PRp.resample(rule=resample_freq).mean(), # averaging data'PRe': NIGHT.PRe.resample(rule=resample_freq).mean(),
'Norm_Eff': NIGHT.Norm_Eff.resample(rule=resample_freq).mean(),
'SR_Gen': NIGHT.SR_Gen.resample(rule=resample_freq).sum(), # summing data'SR_All': NIGHT.SR_All.resample(rule=resample_freq).sum()
})
NIGHT_R.dropna(inplace=True) # removes the times during 'day' (which show as NA)
DAY.index = DAY.dates
DAY_R = pd.DataFrame(data={
'PRp': DAY.PRp.resample(rule=resample_freq).mean(),
'PRe': DAY.PRe.resample(rule=resample_freq).mean(),
'Norm_Eff': DAY.Norm_Eff.resample(rule=resample_freq).mean(),
'SR_Gen': DAY.SR_Gen.resample(rule=resample_freq).sum(),
'SR_All': DAY.SR_All.resample(rule=resample_freq).sum()
})
DAY_R.dropna(inplace=True) # removes the times during 'night' (which show as NA)# save to .csv with date and time in file name# specify the save path of your choice
path_night = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_NIGHT_{2}.csv'.format(year, calendar.month_name[month], freq_tag)
path_day = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_DAY_{2}.csv'.format(year, calendar.month_name[month], freq_tag)
# some of the above NIGHT_R / DAY_R filtering will return no rows.# Check for this, and only save if the dataframe contains rowsif NIGHT_R.shape[0] > 0:
NIGHT_R.to_csv(path_night, index=True)
if DAY_R.shape[0] > 0:
DAY_R.to_csv(path_day, index=True)
The above will result in a total of SIX .csv
files per month:
- Hourly basis for daytime
- Daily basis for daytime
- Monthly basis for daytime
- Hourly basis for nighttime
- Daily basis for nighttime
- Monthly basis for nighttime
Each file will have a file name as follows: (Year)(Month_Name)(Day/Night)(frequency). For example: 2016August_NIGHT_Daily
Let me know if the above achieves the goal or not.
Also, here is a list of available resample
frequencies you can choose from: pandas resample documentation
Solution 2:
@NickBraunagel sincere thanks for the time you've put in to this question. I apologise for my tardiness in replying. I was also on vacation and I only just returned. Your code looks very good and is potentially more efficient than my own. I will run it as soon as work quietens down to see if this is the case. However, while I was waiting for a response, I managed to solve the issue myself. I have uploaded the code below.
To avoid writing out each column name and whether to 'mean' or 'sum' the data over the re-sampling time period, I have manually created another excel document that lists the column headers in row 1 and lists "mean" or "sum" below the header (n*columns x 2 rows), then I convert this csv to a dictionary and refer to it in the re-sampling code. See Below.
Also, I import the data already being 24Hour, Daytime and Nightime files, and then re-sample.
import pandas as pd
import glob
#project specific paths - comment (#) all paths not relevant#read in manually created re-sampling csv file to reference later as a dictionary in the re-sampling code#the file below consists of n*columns x 2 rows, where row 1 is the column headers and row 2 specifies whether that column is to be averaged ('mean') or summed ('sum') over the re-sampling time period
f =pd.read_csv('C:/Users/cp_vm/Documents/ResampleData/AllData.csv')
#convert manually created resampling csv to dictionary ({'columnname': resample,'columnname2': resample2)}
recordcol = list(f.columns.values)
recordrow = f.iloc[0,:]
how_map=dict(zip(recordcol,recordrow))
what_to_do = dict(zip(f.columns, [how_map[x] for x in recordcol]))
#this is not very efficient, but for the time being, comment (#) all paths not relevant#meaning run the script multiple times, each time changing the in' and outpaths#read in datafiles via their specific paths: order - AllData 24Hour, AllData DayTime, AllData NightTime
inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/24Hour/'
outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/24Hour/{0}_{1}_{2}_AllData_24Hour.csv'#inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/Daytime/'#outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/Daytime/{0}_{1}_{2}_AllData_Daytime.csv'#inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/Nighttime/'#outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/Nighttime/{0}_{1}_{2}_AllData_Nighttime.csv'
allFiles = glob.glob(inpath + "/*.csv")
#resample all incoming files to be hourly-h, daily-d, or monthly-m and export with automatic naming of filesfor files_ in allFiles:
#read in all files_
df = pd.read_csv(files_,index_col = None, parse_dates = ['Datetime'])
df.index = pd.to_datetime(df.Datetime)
#change Datetime column to be numeric, so it can be resampled without being removed
df['Datetime'] = pd.to_numeric(df['Datetime'])
#specify year and month for automatic naming of files
year = df.index.year[1]
month = df.index.month[1]
#comment (#) irrelevant resamplping, so run it three times, changing h, d and m
resample = "h"#resample = "d"#resample = "m"#resample df based on the dictionary defined by what_to_do and resample - please note that 'Datetime' has the resampling 'min' associated to it in the manually created re-sampling csv file
df = df.resample(resample).agg(what_to_do)
#drop rows where all column values are non existent
df = df.dropna(how='all')
#change Datetime column back to datetime.datetime format
df.Datetime = pd.to_datetime(df.Datetime)
#make datetime column the index
df.index = df.Datetime
#move datetime column to the front of dataframe
cols = list(df.columns.values)
cols.pop(cols.index('Datetime'))
df = df[['Datetime'] + cols]
#export all files automating their names dependent on their datetime#if the dataframe has any rows, then export itif df.shape[0] > 0:
df.to_csv(outpath.format(year,month,resample), index=False)
Post a Comment for "Resampling Multiple Csv Files And Automatically Saving Resampled Files With New Names"