python - How can I optimize this dataframe filtering? -


i have dataframe of weather data:

            id        date element  data_value 0   usw00094889  2014-11-12    tmax          22 1   usc00208972  2009-04-29    tmin          56 2   usc00200032  2008-05-26    tmax         278 3   usc00205563  2005-11-11    tmax         139 4   usc00200230  2014-02-27    tmax        -106 5   usw00014833  2010-10-01    tmax         194 6   usc00207308  2010-06-29    tmin         144 7   usc00203712  2005-10-04    tmax         289 8   usw00004848  2007-12-14    tmin         -16 9   usc00200220  2011-04-21    tmax          72 10  usc00205822  2013-01-16    tmax          11 11  usc00205822  2008-05-29    tmin          28 12  usc00203712  2008-10-17    tmin          17 13  usc00205563  2006-05-14    tmax         183 14  usc00200842  2006-05-14    tmax         122  ....  165083  usc00200230  2006-11-29    tmin         117 

i'd make 2 lists - of min , max temp each day. way tried doing making list of dates: dates = df['date'].unique() , , looping through data , appending values lists:

for in dates:     mint.append(df[(df['date']==i) & (df['element'] == 'tmin')]['data_value'].min())     maxt.append(df[(df['date']==i) & (df['element'] == 'tmax')]['data_value'].max()) 

i tried sorting dataframe dates , data_values, , picking out first in list max, , last min:

df = df.sort_values(['date','data_value'], ascending=false)  in dates:     mint.append(df[df['date']==dates[0]]['data_value'].values[-1])     maxt.append(df[df['date']==dates[0]]['data_value'].values[0]) 

but still takes reeeeeeeally long :( ... please me make faster?

you may want try pandas.dataframe.groupby method:

# generate test data data = \     u"""     id,date,element,data_value     usw00094889,2014-11-12,tmax,22     usc00208972,2014-11-12,tmin,56     usc00200032,2008-05-26,tmax,278     usc00205563,2005-11-11,tmax,139     usc00200230,2014-02-27,tmax,-106     usw00014833,2010-10-01,tmax,194     usc00207308,2010-06-29,tmin,144     usc00203712,2012-06-29,tmax,289     usw00004848,2007-12-14,tmin,-16     usc00200220,2011-04-21,tmax,72     usc00205822,2013-01-16,tmax,11     usc00205822,2008-05-29,tmin,28     usc00203712,2006-05-14,tmin,17     usc00205563,2006-05-14,tmax,183     usc00200842,2006-05-14,tmax,122     """  buffer = io.stringio(data) df = pandas.dataframe.from_csv(buffer).reset_index(0)  # here magic sauce iteration grouper = df.groupby('date') df_min_max = pandas.dataframe(columns=['min', 'max'])  # can use grouper iteration date, data in grouper:     df_min_max.loc[date, 'min'] = min(data['data_value'])     df_min_max.loc[date, 'max'] = max(data['data_value']) 

note: can add other fields output dataframe if like. aware appending dataframe becomes more expensive larger dataframe becomes. may want append max , min values list, depending on how data analyzing.


Comments