i'm new in scala , spark , don't know how explode "path" field , found max , min "event_dttm" field 1 pass. have data:
val weblog=sc.parallelize(seq( ("39f0412b4c91","staticnavi.com", seq( "panel", "cm.html" ), 1424954530, "so.01"), ("39f0412b4c91","staticnavi.com", seq( "panel", "cm.html" ), 1424964830, "so.01"), ("39f0412b4c91","staticnavi.com", seq( "panel", "cm.html" ), 1424978445, "so.01"), )).todf("id","domain","path","event_dttm","load_src")
i must next result:
"id" | "domain" |"newpath" | "max_time" | min_time | "load_src" 39f0412b4c91|staticnavi.com| panel | 1424978445 | 1424954530 | so.01 39f0412b4c91|staticnavi.com| cm.html | 1424978445 | 1424954530 | so.01
i think it's possible realize via row function, don't know how.
you looking explode()
, followed groupby
aggregation:
import org.apache.spark.sql.functions.{explode, min, max} var result = weblog.withcolumn("path", explode($"path")) .groupby("id","domain","path","load_src") .agg(min($"event_dttm").as("min_time"), max($"event_dttm").as("max_time")) result.show() +------------+--------------+-------+--------+----------+----------+ | id| domain| path|load_src| min_time| max_time| +------------+--------------+-------+--------+----------+----------+ |39f0412b4c91|staticnavi.com| panel| so.01|1424954530|1424978445| |39f0412b4c91|staticnavi.com|cm.html| so.01|1424954530|1424978445| +------------+--------------+-------+--------+----------+----------+
Comments
Post a Comment