pandas - filter hourly data python -


i have different sets of data data in 5 minute/15 minute or 30 minute interval. there 100s of such files (in different formats - .dat, .txt, .csv etc.) filter out hourly data these files using python. new using pandas , while trying learn library, appreaciated.

date        time    point_1 27/3/2017   0:00:00 13.08 27/3/2017   0:05:00 12.96 27/3/2017   0:10:00 13.3 27/3/2017   0:15:00 13.27 27/3/2017   0:20:00 13.15 27/3/2017   0:25:00 13.14 27/3/2017   0:30:00 13.25 27/3/2017   0:35:00 13.26 27/3/2017   0:40:00 13.24 27/3/2017   0:45:00 13.43 27/3/2017   0:50:00 13.23 27/3/2017   0:55:00 13.27 27/3/2017   1:00:00 13.19 27/3/2017   1:05:00 13.17 27/3/2017   1:10:00 13.1 27/3/2017   1:15:00 13.06 27/3/2017   1:20:00 12.99 27/3/2017   1:25:00 13.08 27/3/2017   1:30:00 13.04 27/3/2017   1:35:00 13.06 27/3/2017   1:40:00 13.07 27/3/2017   1:45:00 13.07 27/3/2017   1:50:00 13.02 27/3/2017   1:55:00 13.13 27/3/2017   2:00:00 12.99 

you can use read_csv parameter parse_dates convert columns date , time datetime first:

import pandas pd pandas.compat import stringio  temp=u"""date        time    point_1 27/3/2017   0:00:00 13.08 27/3/2017   0:05:00 12.96 27/3/2017   0:10:00 13.3 27/3/2017   0:15:00 13.27 27/3/2017   0:20:00 13.15 27/3/2017   0:25:00 13.14 27/3/2017   0:30:00 13.25 27/3/2017   0:35:00 13.26 27/3/2017   0:40:00 13.24 27/3/2017   0:45:00 13.43 27/3/2017   0:50:00 13.23 27/3/2017   0:55:00 13.27 27/3/2017   1:00:00 13.19 27/3/2017   1:05:00 13.17 27/3/2017   1:10:00 13.1 27/3/2017   1:15:00 13.06 27/3/2017   1:20:00 12.99 27/3/2017   1:25:00 13.08 27/3/2017   1:30:00 13.04 27/3/2017   1:35:00 13.06 27/3/2017   1:40:00 13.07 27/3/2017   1:45:00 13.07 27/3/2017   1:50:00 13.02 27/3/2017   1:55:00 13.13 27/3/2017   2:00:00 12.99""" #after testing replace 'stringio(temp)' 'filename.csv' df = pd.read_csv(stringio(temp),                  sep="\s+", #alternatively delim_whitespace=true                 index_col=[0],                  parse_dates={'dates':['date','time']}) 

then resample , aggregate first or sum, mean...:

df1 = df.resample('1h')['point_1'].first().reset_index() print (df1)                 dates  point_1 0 2017-03-27 00:00:00    13.08 1 2017-03-27 01:00:00    13.19 2 2017-03-27 02:00:00    12.99 
df1 = df.resample('1h')['point_1'].sum().reset_index() print (df1)                 dates  point_1 0 2017-03-27 00:00:00   158.58 1 2017-03-27 01:00:00   156.98 2 2017-03-27 02:00:00    12.99 

another solution groupby , grouper:

df1 = df.groupby(pd.grouper(freq='1h')).first().reset_index() print (df1)                 dates  point_1 0 2017-03-27 00:00:00    13.08 1 2017-03-27 01:00:00    13.19 2 2017-03-27 02:00:00    12.99 

or maybe need:

df = pd.read_csv(stringio(temp),delim_whitespace=true, parse_dates={'dates':['date','time']})  mask = df.dates.dt.round('h').ne(df.dates) df1 = df[mask] print (df1)                  dates  point_1 1  2017-03-27 00:05:00    12.96 2  2017-03-27 00:10:00    13.30 3  2017-03-27 00:15:00    13.27 4  2017-03-27 00:20:00    13.15 5  2017-03-27 00:25:00    13.14 6  2017-03-27 00:30:00    13.25 7  2017-03-27 00:35:00    13.26 8  2017-03-27 00:40:00    13.24 9  2017-03-27 00:45:00    13.43 10 2017-03-27 00:50:00    13.23 11 2017-03-27 00:55:00    13.27 13 2017-03-27 01:05:00    13.17 14 2017-03-27 01:10:00    13.10 15 2017-03-27 01:15:00    13.06 16 2017-03-27 01:20:00    12.99 17 2017-03-27 01:25:00    13.08 18 2017-03-27 01:30:00    13.04 19 2017-03-27 01:35:00    13.06 20 2017-03-27 01:40:00    13.07 21 2017-03-27 01:45:00    13.07 22 2017-03-27 01:50:00    13.02 23 2017-03-27 01:55:00    13.13 

Comments