apache spark - How to load field with null value from json as number in Dataframe -


one of json field (age below) meant number represented null coming string in dataframe printschema

input json file

{"age":null,"name":"abc","batch":190} {"age":null,"name":"abc","batch":190} 

spark code , output

val df = spark.read.json("/home/white/tmp/a.json") df.printschema() df.show()  ********************* output ********************* root  |-- batch: long (nullable = true)  |-- age: string (nullable = true)  |-- name: string (nullable = true)  +-----+----+----+ |batch|age|name| +-----+----+----+ |  190|null| abc| |  190|null| abc| +-----+----+----+ 

i want age long , achieving creating new structtype age field long , recreating dataframe df.sqlcontext.createdataframe( df.rdd, newschema ). can done while spark.read.json api directly?

i think easiest way follows:

spark.read.json("/home/white/tmp/a.json").withcolumn("age", 'age.cast(longtype)) 

this produces following schema:

root  |-- age: long (nullable = true)  |-- batch: long (nullable = true)  |-- name: string (nullable = true) 

spark makes best guess on types, , makes sense see null in json , think "string" since string lies on nullable anyref side of scala object hierarchy while long lies on non-nullable anyval side. need cast column make spark treat data see fit.

incidentally, why using long rather int ages? people must eat very healthy.


Comments