i have different sets of data data in 5 minute/15 minute or 30 minute interval. there 100s of such files (in different formats - .dat, .txt, .csv etc.) filter out hourly data these files using python. new using pandas , while trying learn library, appreaciated.
date time point_1 27/3/2017 0:00:00 13.08 27/3/2017 0:05:00 12.96 27/3/2017 0:10:00 13.3 27/3/2017 0:15:00 13.27 27/3/2017 0:20:00 13.15 27/3/2017 0:25:00 13.14 27/3/2017 0:30:00 13.25 27/3/2017 0:35:00 13.26 27/3/2017 0:40:00 13.24 27/3/2017 0:45:00 13.43 27/3/2017 0:50:00 13.23 27/3/2017 0:55:00 13.27 27/3/2017 1:00:00 13.19 27/3/2017 1:05:00 13.17 27/3/2017 1:10:00 13.1 27/3/2017 1:15:00 13.06 27/3/2017 1:20:00 12.99 27/3/2017 1:25:00 13.08 27/3/2017 1:30:00 13.04 27/3/2017 1:35:00 13.06 27/3/2017 1:40:00 13.07 27/3/2017 1:45:00 13.07 27/3/2017 1:50:00 13.02 27/3/2017 1:55:00 13.13 27/3/2017 2:00:00 12.99
you can use read_csv
parameter parse_dates
convert columns date
, time
datetime
first:
import pandas pd pandas.compat import stringio temp=u"""date time point_1 27/3/2017 0:00:00 13.08 27/3/2017 0:05:00 12.96 27/3/2017 0:10:00 13.3 27/3/2017 0:15:00 13.27 27/3/2017 0:20:00 13.15 27/3/2017 0:25:00 13.14 27/3/2017 0:30:00 13.25 27/3/2017 0:35:00 13.26 27/3/2017 0:40:00 13.24 27/3/2017 0:45:00 13.43 27/3/2017 0:50:00 13.23 27/3/2017 0:55:00 13.27 27/3/2017 1:00:00 13.19 27/3/2017 1:05:00 13.17 27/3/2017 1:10:00 13.1 27/3/2017 1:15:00 13.06 27/3/2017 1:20:00 12.99 27/3/2017 1:25:00 13.08 27/3/2017 1:30:00 13.04 27/3/2017 1:35:00 13.06 27/3/2017 1:40:00 13.07 27/3/2017 1:45:00 13.07 27/3/2017 1:50:00 13.02 27/3/2017 1:55:00 13.13 27/3/2017 2:00:00 12.99""" #after testing replace 'stringio(temp)' 'filename.csv' df = pd.read_csv(stringio(temp), sep="\s+", #alternatively delim_whitespace=true index_col=[0], parse_dates={'dates':['date','time']})
then resample
, aggregate first
or sum
, mean
...:
df1 = df.resample('1h')['point_1'].first().reset_index() print (df1) dates point_1 0 2017-03-27 00:00:00 13.08 1 2017-03-27 01:00:00 13.19 2 2017-03-27 02:00:00 12.99
df1 = df.resample('1h')['point_1'].sum().reset_index() print (df1) dates point_1 0 2017-03-27 00:00:00 158.58 1 2017-03-27 01:00:00 156.98 2 2017-03-27 02:00:00 12.99
another solution groupby
, grouper
:
df1 = df.groupby(pd.grouper(freq='1h')).first().reset_index() print (df1) dates point_1 0 2017-03-27 00:00:00 13.08 1 2017-03-27 01:00:00 13.19 2 2017-03-27 02:00:00 12.99
or maybe need:
df = pd.read_csv(stringio(temp),delim_whitespace=true, parse_dates={'dates':['date','time']}) mask = df.dates.dt.round('h').ne(df.dates) df1 = df[mask] print (df1) dates point_1 1 2017-03-27 00:05:00 12.96 2 2017-03-27 00:10:00 13.30 3 2017-03-27 00:15:00 13.27 4 2017-03-27 00:20:00 13.15 5 2017-03-27 00:25:00 13.14 6 2017-03-27 00:30:00 13.25 7 2017-03-27 00:35:00 13.26 8 2017-03-27 00:40:00 13.24 9 2017-03-27 00:45:00 13.43 10 2017-03-27 00:50:00 13.23 11 2017-03-27 00:55:00 13.27 13 2017-03-27 01:05:00 13.17 14 2017-03-27 01:10:00 13.10 15 2017-03-27 01:15:00 13.06 16 2017-03-27 01:20:00 12.99 17 2017-03-27 01:25:00 13.08 18 2017-03-27 01:30:00 13.04 19 2017-03-27 01:35:00 13.06 20 2017-03-27 01:40:00 13.07 21 2017-03-27 01:45:00 13.07 22 2017-03-27 01:50:00 13.02 23 2017-03-27 01:55:00 13.13
Comments
Post a Comment