Resampling Multiple Csv Files And Automatically Saving Resampled Files With New Names

June 29, 2023 Post a Comment

I've really tried hard now for over a week to solve this problem and I cannot seem to find a solution. Some coders have been excellent in helping but unfortunately no one is yet to

Solution 1:

I think you can re-use the code from our previous work (here). Using the original code, when the NIGHT and DAY dataframes are created, you can then resample them on an hourly, daily, and monthly basis and save the new (resampled) dataframes as .csv files wherever you like.

I am going to use a sample dataframe (first 3 rows shown here):

datesPRpPReNorm_EffSR_GenSR_All2016-01-01 00:00:00 0.2693890.5177200.8586038123.746453 8770.5604672016-01-01 00:15:00 0.2833160.5532030.8622537868.675481 8130.9744092016-01-01 00:30:00 0.2865900.6939970.9484638106.217144 8314.584848

Full Code

import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
from random import randint
import random
import calendar

# I defined a sample dataframe with dummy data
start = datetime.datetime(2016,1,1,0,0)
r = range(0,10000)

dates = [start + relativedelta(minutes=15*i) for i in r]
PRp = [random.uniform(.2, .3) for i in r]
PRe = [random.uniform(0.5, .7) for i in r]
Norm_Eff = [random.uniform(0.7, 1) for i in r]
SR_Gen = [random.uniform(7500, 8500) for i in r]
SR_All = [random.uniform(8000, 9500) for i in r]

DF = pd.DataFrame({
        'dates': dates,
        'PRp': PRp,
        'PRe': PRe,
        'Norm_Eff': Norm_Eff,
        'SR_Gen': SR_Gen,
        'SR_All': SR_All,
    })



# define when day starts and ends (MUST USE 24 CLOCK)
day = {
        'start': datetime.time(6,0),  # start at 6am (6:00)'end': datetime.time(18,0)  # ends at 6pm (18:00)
      }


# capture years that appear in dataframe
min_year = DF.dates.min().year
max_year = DF.dates.max().year

if min_year == max_year:
    yearRange = [min_year]
else:
    yearRange = range(min_year, max_year+1)

# iterate over each year and each month within each yearfor year in yearRange:
    for month inrange(1,13):

        # filter to show NIGHT and DAY dataframe for given month within given year
        NIGHT = DF[(DF.dates >= datetime.datetime(year, month, 1)) & 
           (DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) & 
           ((DF.dates.apply(lambda x: x.time()) <= day['start']) | (DF.dates.apply(lambda x: x.time()) >= day['end']))]

        DAY = DF[(DF.dates >= datetime.datetime(year, month, 1)) & 
           (DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) & 
           ((DF.dates.apply(lambda x: x.time()) > day['start']) & (DF.dates.apply(lambda x: x.time()) < day['end']))]

        # Create resampled dataframes on Hourly, Daily, Monthly basisfor resample_freq, freq_tag inzip(['H','D','M'], ['Hourly','Daily','Monthly']):

            NIGHT.index = NIGHT.dates                           # resampled column must be placed in index
            NIGHT_R = pd.DataFrame(data={
                    'PRp': NIGHT.PRp.resample(rule=resample_freq).mean(),            # averaging data'PRe': NIGHT.PRe.resample(rule=resample_freq).mean(),
                    'Norm_Eff': NIGHT.Norm_Eff.resample(rule=resample_freq).mean(),
                    'SR_Gen': NIGHT.SR_Gen.resample(rule=resample_freq).sum(),        # summing data'SR_All': NIGHT.SR_All.resample(rule=resample_freq).sum()  
                })
            NIGHT_R.dropna(inplace=True)  # removes the times during 'day' (which show as NA)

            DAY.index = DAY.dates
            DAY_R = pd.DataFrame(data={
                    'PRp': DAY.PRp.resample(rule=resample_freq).mean(),
                    'PRe': DAY.PRe.resample(rule=resample_freq).mean(),
                    'Norm_Eff': DAY.Norm_Eff.resample(rule=resample_freq).mean(),
                    'SR_Gen': DAY.SR_Gen.resample(rule=resample_freq).sum(),        
                    'SR_All': DAY.SR_All.resample(rule=resample_freq).sum()  
                })
            DAY_R.dropna(inplace=True)  # removes the times during 'night' (which show as NA)# save to .csv with date and time in file name# specify the save path of your choice
            path_night = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_NIGHT_{2}.csv'.format(year, calendar.month_name[month], freq_tag)
            path_day = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_DAY_{2}.csv'.format(year, calendar.month_name[month], freq_tag)

            # some of the above NIGHT_R / DAY_R filtering will return no rows.# Check for this, and only save if the dataframe contains rowsif NIGHT_R.shape[0] > 0:
                NIGHT_R.to_csv(path_night, index=True)
            if DAY_R.shape[0] > 0:
                DAY_R.to_csv(path_day, index=True)

The above will result in a total of SIX .csv files per month:

Hourly basis for daytime
Daily basis for daytime
Monthly basis for daytime
Hourly basis for nighttime
Daily basis for nighttime
Monthly basis for nighttime

Each file will have a file name as follows: (Year)(Month_Name)(Day/Night)(frequency). For example: 2016August_NIGHT_Daily

Let me know if the above achieves the goal or not.

Also, here is a list of available resample frequencies you can choose from: pandas resample documentation

Baca Juga

Solution 2:

@NickBraunagel sincere thanks for the time you've put in to this question. I apologise for my tardiness in replying. I was also on vacation and I only just returned. Your code looks very good and is potentially more efficient than my own. I will run it as soon as work quietens down to see if this is the case. However, while I was waiting for a response, I managed to solve the issue myself. I have uploaded the code below.

To avoid writing out each column name and whether to 'mean' or 'sum' the data over the re-sampling time period, I have manually created another excel document that lists the column headers in row 1 and lists "mean" or "sum" below the header (n*columns x 2 rows), then I convert this csv to a dictionary and refer to it in the re-sampling code. See Below.

Also, I import the data already being 24Hour, Daytime and Nightime files, and then re-sample.

import pandas as pd
import glob

#project specific paths - comment (#) all paths not relevant#read in manually created re-sampling csv file to reference later as a dictionary in the re-sampling code#the file below consists of n*columns x 2 rows, where row 1 is the column headers and row 2 specifies whether that column is to be averaged ('mean') or summed ('sum') over the re-sampling time period
f =pd.read_csv('C:/Users/cp_vm/Documents/ResampleData/AllData.csv')

#convert manually created resampling csv to dictionary ({'columnname': resample,'columnname2': resample2)}
recordcol = list(f.columns.values)
recordrow = f.iloc[0,:]
how_map=dict(zip(recordcol,recordrow))
what_to_do = dict(zip(f.columns, [how_map[x] for x in recordcol]))

#this is not very efficient, but for the time being, comment (#) all paths not relevant#meaning run the script multiple times, each time changing the in' and outpaths#read in datafiles via their specific paths: order - AllData 24Hour, AllData DayTime, AllData NightTime
inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/24Hour/'
outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/24Hour/{0}_{1}_{2}_AllData_24Hour.csv'#inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/Daytime/'#outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/Daytime/{0}_{1}_{2}_AllData_Daytime.csv'#inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/Nighttime/'#outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/Nighttime/{0}_{1}_{2}_AllData_Nighttime.csv'

allFiles = glob.glob(inpath + "/*.csv")

#resample all incoming files to be hourly-h, daily-d, or monthly-m and export with automatic naming of filesfor files_ in allFiles:
    #read in all files_
    df = pd.read_csv(files_,index_col = None, parse_dates = ['Datetime'])
    df.index = pd.to_datetime(df.Datetime)
    #change Datetime column to be numeric, so it can be resampled without being removed
    df['Datetime'] = pd.to_numeric(df['Datetime'])
    #specify year and month for automatic naming of files
    year = df.index.year[1]
    month = df.index.month[1]
    #comment (#) irrelevant resamplping, so run it three times, changing h, d and m
    resample = "h"#resample = "d"#resample = "m"#resample df based on the dictionary defined by what_to_do and resample - please note that 'Datetime' has the resampling 'min' associated to it in the manually created re-sampling csv file
    df = df.resample(resample).agg(what_to_do)
    #drop rows where all column values are non existent
    df = df.dropna(how='all')
    #change Datetime column back to datetime.datetime format
    df.Datetime = pd.to_datetime(df.Datetime)
    #make datetime column the index
    df.index = df.Datetime
    #move datetime column to the front of dataframe
    cols = list(df.columns.values)
    cols.pop(cols.index('Datetime'))
    df = df[['Datetime'] + cols]
    #export all files automating their names dependent on their datetime#if the dataframe has any rows, then export itif df.shape[0] > 0:
        df.to_csv(outpath.format(year,month,resample), index=False)

Python Manual

Resampling Multiple Csv Files And Automatically Saving Resampled Files With New Names

Solution 1:

Full Code

Solution 2:

Post a Comment for "Resampling Multiple Csv Files And Automatically Saving Resampled Files With New Names"