Spark moving average -


i trying implement moving average dataset containing number of time series. each column represents 1 parameter being measured, while 1 row contains parameters measured in second. row like:

timestamp, parameter1, parameter2, ..., parametern 

i found way using window functions, following bugs me:

partitioning specification: controls rows in same partition given row. also, user might want make sure rows having same value category column collected same machine before ordering , calculating frame. if no partitioning specification given, data must collected single machine.

the thing is, don't have partition by. can use method calculate moving average without risk of collecting data on single machine? if not, better way it?

every nontrivial spark job demands partitioning. there no way around if want jobs finish before apocalypse. question simple: when comes time inevitable aggregation (in case, average), how can partition data in such way minimize shuffle grouping related data possible on same machine?

my experience moving averages stocks. in case it's easy; partition on stock ticker symbol. after all, calculation of 50-day moving average stock has nothing stock b, data don't need on same machine. obvious partition makes simpler situation--not mention requires 1 data point (probably) per day (the closing price of stock @ end of trading) while have 1 per second.

so can need consider adding feature data set sole purpose serve partition key if irrelevant you're measuring. surprised if there isn't one, if not, consider time-based partition on days example.


Comments