one of json field (age below) meant number represented null coming string in dataframe printschema
input json file
{"age":null,"name":"abc","batch":190} {"age":null,"name":"abc","batch":190}
spark code , output
val df = spark.read.json("/home/white/tmp/a.json") df.printschema() df.show() ********************* output ********************* root |-- batch: long (nullable = true) |-- age: string (nullable = true) |-- name: string (nullable = true) +-----+----+----+ |batch|age|name| +-----+----+----+ | 190|null| abc| | 190|null| abc| +-----+----+----+
i want age long , achieving creating new structtype age field long , recreating dataframe df.sqlcontext.createdataframe( df.rdd, newschema ). can done while spark.read.json api directly?
i think easiest way follows:
spark.read.json("/home/white/tmp/a.json").withcolumn("age", 'age.cast(longtype))
this produces following schema:
root |-- age: long (nullable = true) |-- batch: long (nullable = true) |-- name: string (nullable = true)
spark makes best guess on types, , makes sense see null
in json , think "string" since string
lies on nullable anyref
side of scala object hierarchy while long
lies on non-nullable anyval
side. need cast column make spark treat data see fit.
incidentally, why using long
rather int
ages? people must eat very healthy.
Comments
Post a Comment