Spark Worker not joining Master after Master dies and comes back -


i wondering on how worker pings master check on master's liveness? or master (resource manager) pings workers check on liveness , if workers dead spawn ? or both?

some info: standalone cluster 1 master - 8core 12gb 32 workers - each 8 core , 8 gb

my main problem - here's happened:

master m - running 32 workers worker 1 , 2 died @ 03:55:00 - cluster 30 workers

worker 1' came @ 03:55:12.000 - connected m worker 2' came @ 03:55:16.000 - connected m

master m dies @ 03:56.00 new master nm' comes @ 03:56:30 worker 1' , 2' - not connect nm remaining 30 workers connect nm.

so nm has 30 workers.

i wondering on why 2 won't connect new master nm though master m dead sure.

ps:i have lb setup master means whenever new master comes in lb start pointing new one.

load balancer won't resolve problem here. spark workers recognize new master have configure spark in high availability mode. spark standalone supports 2 ha configurations:

  • standby master zookeeper.
  • node recovery using file system.

the latter solution simpler requires reliable, distributed file system store spark.deploy.recoverydirectory, unless recover master on same node of course.

recovery mode can configured using spark.deploy.recoverymode property (none by default) should set zookeeper , filesystem standby , node recovery respectively.

more details can found in high availability documentation.

related: what happens when spark master fails?


Comments