I am trying to understand why the following occurred:
- I have a Docker container with Yarn and Spark running fine except that the timestamp of that container was minus X hours of what I wanted it to be. So when I was running
date
it was returning a timestamp minus X hours of the current timestamp. - Managed to fix the above by passing a TZ environment variable in the
docker run
command, so when I typedate
I get the correct timestamp. - However, when I run
spark-submit
(cluster mode is yarn) applications in YARN, the timestamp in the AM logs is still the wrong one(minus X hours). - Managed to fix the above by passing a timezone setting for the JVM in
spark-submit
:-conf 'spark.executor.extraJavaOptions=-Duser.timezone'
and-conf 'spark.driver.extraJavaOptions=-Duser.timezone'
. - This tells me that there was an issue with the JVM YARN uses. However when tried to get the datetime from SparkScala shell it was returning the correct time(using
system.currenttimemillis()
) without specifying any JVM settings from step 4.
Questions
- How can I tell what JVM is being used at container launch from YARN Application Master and what JVM at SparkScala shell?
- Why are there different timestamps when running in shell/bash and spark-submit ?