amazon web services - Hive cannot find file from distributed cache on EMR -


i'm trying run udf in hive, should scan through external csv file using value table argument. query use:

add jar s3://bucket_name/udf/hiveudf.jar; add file hdfs:///myfile/myfile.csv; create temporary function myfunc '....udf.myudf'; select mydate, record_id, value, myfunc('myfile.csv',value) my_table; 

results unstable , in cases exact same query works fine, in 80% of cases returns exception:

java.io.filenotfoundexception: myfile.csv (no such file or directory)     @ java.io.fileinputstream.open(native method)     @ java.io.fileinputstream.<init>(fileinputstream.java:146)     @ java.io.fileinputstream.<init>(fileinputstream.java:101)     @ java.io.filereader.<init>(filereader.java:58) 

...

file seems added distributed cache:

hive> list files; /mnt/tmp/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx_resources/myfile.csv 

i tried various releases of emr various instance types , couldn't find pattern or triggers issue. advise highly appreciated.

you might enable debug find more info. in general, i've seen similar issues when there resize(shrink) on emr cluster causing blocks of expected hdfs distributed cache file removed cluster because of not enough replication.


Comments