org.apache.spark.shuffle.FetchFailedException in spark Job in docker enviroment

I am running a custom spark job in a docker environment using the C3 8xLarge machines. I have 6 nodes. The specification of the cluster are; 32 cores per node and 47gb ram per node.

The main idea of this job is as following:
The job is of filling type. The input data consists of three major files; Primary data, Secondary data and a temporary data file. The Secondary data is ingested into Aerospike for filling on the runtime whereas the temporary data is placed in HDFS. The Job checks the flag in mongoDB against a date. If the flag is true it downloads the Primary data from a bucket in S3. Afterward, It loads the data from the downloaded compressed file into Hdfs and creates an RDD. Moving on, it filters the data on the base of the primary key from the temporary file. It then creates a hashmap and check for required columns in the primary data, if not found it checks them from aerospike even if it is not found in the secondary data it fills the data from the temporary file. At last, it stores the filled data into hdfs.

I am using the following versions of the services:

  • Spark:1.6.0
  • Hadoop: Hadoop 2.7.1, Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a,
    Compiled by jenkins on 2015-06-29T06:04Z,
    Compiled with protoc 2.5.0
  • Aerospike: Aerospike Community Edition build 3.8.2.3

The spark default configurations are:
spark.master spark://spark-hadoop-master-svc:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://spark-hadoop-master-svc:9000/user/spark/logs
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.worker.cleanup.enabled true
spark.worker.cleanup.appDataTtl 86400
spark.driver.memory 2G
spark.sql.tungsten.enabled false
spark.locality.wait 30s spark.driver.extraJavaOptions -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:+G1SummarizeConcMark -Xms2g -Xmx2g

The spark-env configurations are:
export SCALA_HOME=$SCALA_HOME
export SPARK_HOME=$SPARK_HOME
export HADOOP_HOME=$HADOOP_PREFIX
export SPARK_LOCAL_DIR=/tmp/spark
export SPARK_PUBLIC_DNS=54.244.180.183
export SPARK_MASTER_IP=spark-hadoop-master-svc
export SPARK_LOCAL_IP=10.244.89.3
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export SPARK_WORKER_MEMORY=SPARK_WORKER_MEMORY
export SPARK_WORKER_CORES=SPARK_WORKER_CORES
export SPARK_MASTER_WEBUI_PORT=8181

Hdfs Block size: 128 mbs

The resources I give to the job are as following:
Cores provided 20 per node
Ram provided 38 gb per node
Executor Ram gb

The job specifications are:
Stages 4
Exception Stage Last Stage
Locality Level NODE_LOCAL
Partitions stage1 =1, Stage2 =17, stage3= 100, Stage4 =100
GC time Stage3 = 4s out of 26s (average)

Exception org.apache.spark.shuffle.FetchFailedException: /tmp/spark-ce0055f6-a7de-4a37-88ec-ec1fba7ac0a8/executor-685cecb2-6f71-4c67-a510-502a1b3ba890/blockmgr-c4d6ec5a-bd42-400f-a5c9-a5d95d515027/32/shuffle_2_0_0.index (No such file or directory)

Here is the last part of job logs:

  16/09/01 12:36:09 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-244-50-5.us-west-2.compute.internal:43054 in memory (size: 3.3 KB, free: 27.1 GB) 
  16/09/01 12:36:09 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-244-12-3.us-west-2.compute.internal:42626 in memory (size: 3.3 KB, free: 27.1 GB) 
  16/09/01 12:36:09 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-244-60-4.us-west-2.compute.internal:37680 in memory (size: 3.3 KB, free: 27.1 GB) 
  16/09/01 12:36:09 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-244-3-4.us-west-2.compute.internal:39092 in memory (size: 3.3 KB, free: 27.1 GB) 
  16/09/01 12:36:09 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-244-57-5.us-west-2.compute.internal:41361 in memory (size: 3.3 KB, free: 27.1 GB) 
  16/09/01 12:36:09 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on ip-10-244-89-5.us-west-2.compute.internal:33069 in memory (size: 3.3 KB, free: 27.1 GB) 
  16/09/01 12:51:56 WARN scheduler.TaskSetManager: Lost task 84.0 in stage 3.0 (TID 286, ip-10-244-50-5.us-west-2.compute.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded 
    at java.util.Calendar.<init>(Calendar.java:953) 
    at java.util.GregorianCalendar.<init>(GregorianCalendar.java:619) 
    at java.util.Calendar.createCalendar(Calendar.java:1030) 
    at java.util.Calendar.getInstance(Calendar.java:983) 
    at com.platalytics.processing.sparkClasses.BitMapper.updateMap(BitMapper.java:12) 
    at com.platalytics.processing.sparkClasses.AeroSpikeFillerMapper.call(AeroSpikeFillerMapper.java:276) 
    at com.platalytics.processing.sparkClasses.AeroSpikeFillerMapper.call(AeroSpikeFillerMapper.java:25) 
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:192) 
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:192) 
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) 
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
    at java.lang.Thread.run(Thread.java:745) 

  16/09/01 12:51:56 INFO scheduler.TaskSetManager: Starting task 84.1 in stage 3.0 (TID 302, ip-10-244-50-5.us-west-2.compute.internal, partition 84,NODE_LOCAL, 1991 bytes) 
  16/09/01 12:52:03 WARN scheduler.TaskSetManager: Lost task 84.1 in stage 3.0 (TID 302, ip-10-244-50-5.us-west-2.compute.internal): FetchFailed(BlockManagerId(3, ip-10-244-50-5.us-west-2.compute.internal, 43054), shuffleId=2, mapId=0, reduceId=84, message= 
  org.apache.spark.shuffle.FetchFailedException: /tmp/spark-ce0055f6-a7de-4a37-88ec-ec1fba7ac0a8/executor-685cecb2-6f71-4c67-a510-502a1b3ba890/blockmgr-c4d6ec5a-bd42-400f-a5c9-a5d95d515027/32/shuffle_2_0_0.index (No such file or directory) 
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) 
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) 
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) 
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) 
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) 
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152) 
    at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45) 
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89) 
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
    at java.lang.Thread.run(Thread.java:745) 
  Caused by: java.io.FileNotFoundException: /tmp/spark-ce0055f6-a7de-4a37-88ec-ec1fba7ac0a8/executor-685cecb2-6f71-4c67-a510-502a1b3ba890/blockmgr-c4d6ec5a-bd42-400f-a5c9-a5d95d515027/32/shuffle_2_0_0.index (No such file or directory) 
    at java.io.FileInputStream.open(Native Method) 
    at java.io.FileInputStream.<init>(FileInputStream.java:146) 
    at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:191) 
    at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:291) 
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:238) 
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:269) 
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:112) 
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:43) 
    ... 21 more) 
     16/09/01 12:52:03 INFO scheduler.DAGScheduler: Marking ResultStage 3 (saveAsTextFile at DataProcessor.java:424) as failed due to a fetch failure from ShuffleMapStage 2 (mapToPair at DataProcessor.java:409) 
  16/09/01 12:52:03 INFO scheduler.DAGScheduler: ResultStage 3 (saveAsTextFile at DataProcessor.java:424) failed in 1004.165 s 
  16/09/01 12:52:03 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 2 (mapToPair at DataProcessor.java:409) and ResultStage 3 (saveAsTextFile at DataProcessor.java:424) due to fetch failure 
  16/09/01 12:52:03 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 3) 
  16/09/01 12:52:03 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster. 
  16/09/01 12:52:03 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, ip-10-244-50-5.us-west-2.compute.internal, 43054) 
  16/09/01 12:52:03 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor 
  16/09/01 12:52:03 INFO scheduler.ShuffleMapStage: ShuffleMapStage 2 is now unavailable on executor 3 (83/100, false) 
  16/09/01 12:52:03 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 3 (83/100, false) 
  16/09/01 12:52:03 INFO scheduler.ShuffleMapStage: ShuffleMapStage 1 is now unavailable on executor 3 (1/2, false) 
  16/09/01 12:52:03 INFO scheduler.DAGScheduler: Resubmitting failed stages 
  16/09/01 12:52:03 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (hdfs://spark-hadoop-master-svc:9000/automated/data//data/rsegtype=cseg/region=americas/year=2014/month=01/day=01/Joined MapPartitionsRDD[6] at mapToPair at DataProcessor.java:391), which has no missing parents 
  16/09/01 12:52:03 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 6.0 KB, free 194.7 MB) 
  16/09/01 12:52:03 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 3.3 KB, free 194.7 MB) 
  16/09/01 12:52:03 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.244.89.3:42082 (size: 3.3 KB, free: 1233.6 MB) 
  16/09/01 12:52:03 INFO spark.SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006 
  16/09/01 12:52:03 INFO scheduler.DAGScheduler: Submitting 17 missing tasks from ShuffleMapStage 0 (hdfs://spark-hadoop-master-svc:9000/automated/data//data/rsegtype=cseg/region=americas/year=2014/month=01/day=01/Joined MapPartitionsRDD[6] at mapToPair at DataProcessor.java:391) 
  16/09/01 12:52:03 INFO scheduler.TaskSchedulerImpl: Adding task set 0.1 with 17 tasks 
  16/09/01 12:52:03 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 1 (hdfs://spark-hadoop-master-svc:9000/automated/data//data/rsegtype=cseg/region=americas/year=2014/month=01/day=01/static file rdd MapPartitionsRDD[4] at mapToPair at DataProcessor.java:384), which has no missing parents 
  16/09/01 12:52:03 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.1 (TID 303, ip-10-244-57-5.us-west-2.compute.internal, partition 2,NODE_LOCAL, 2336 bytes) 
  16/09/01 12:52:03 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.1 (TID 304, ip-10-244-60-4.us-west-2.compute.internal, partition 9,NODE_LOCAL, 2336 bytes) 
  16/09/01 12:52:03 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.1 (TID 305, ip-10-244-89-5.us-
  16/09/01 12:52:03 INFO scheduler.TaskSchedulerImpl: Adding task set 1.1 with 1 tasks 
  16/09/01 12:52:03 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.1 (TID 320, ip-10-244-3-4.us-west-2.compute.internal, partition 1,NODE_LOCAL, 2246 bytes) 
  16/09/01 12:52:07 WARN server.TransportChannelHandler: Exception in connection from ip-10-244-50-5.us-west-2.compute.internal/10.244.50.5:44794 
  java.io.IOException: Connection reset by peer 
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) 
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) 
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) 
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) 
    at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) 
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) 
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) 
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) 
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) 
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) 
    at java.lang.Thread.run(Thread.java:745) 
  16/09/01 12:52:07 ERROR scheduler.TaskSchedulerImpl: Lost executor 3 on ip-10-244-50-5.us-west-2.compute.internal: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160901123250-0010/3 is now EXITED (Command exited with code 52) 
  16/09/01 12:52:07 INFO cluster.SparkDeploySchedulerBackend: Executor app-20160901123250-0010/3 removed: Command exited with code 52 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 60.0 in stage 3.0 (TID 262, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 96.0 in stage 3.0 (TID 298, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 42.0 in stage 3.0 (TID 244, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 24.0 in stage 3.0 (TID 226, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 78.0 in stage 3.0 (TID 280, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 54.0 in stage 3.0 (TID 256, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 90.0 in stage 3.0 (TID 292, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 72.0 in stage 3.0 (TID 274, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 202, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 18.0 in stage 3.0 (TID 220, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 36.0 in stage 3.0 (TID 238, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 48.0 in stage 3.0 (TID 250, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.  
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 9.0 in stage 0.1 (TID 312, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 0.1 (TID 306, ip-10-244-50-5.us-west-2.compute.internal): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
  16/09/01 12:52:07 INFO cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 3 
  16/09/01 12:52:07 INFO client.AppClient$ClientEndpoint: Executor added: app-20160901123250-0010/6 on worker-20160831102807-10.244.50.5-40034 (10.244.50.5:40034) with 20 cores 
  16/09/01 12:52:07 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 7) 
  16/09/01 12:52:07 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster. 
  16/09/01 12:52:07 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160901123250-0010/6 on hostPort 10.244.50.5:40034 with 20 cores, 38.0 GB RAM 
  16/09/01 12:52:07 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor 
  16/09/01 12:52:07 INFO scheduler.TaskSetManager: Starting task 3.1 in stage 0.1 (TID 321, ip-10-244-12-3.us-west-2.compute.internal, partition 21,NODE_LOCAL, 2336 bytes) 
  16/09/01 12:52:07 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160901123250-0010/6 is now RUNNING 
  16/09/01 12:52:07 INFO scheduler.TaskSetManager: Starting task 9.1 in stage 0.1 (TID 322, ip-10-244-89-5.us-west-2.compute.internal, partition 57,NODE_LOCAL, 2336 bytes) 
  16/09/01 12:52:10 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-244-50-5.us-west-2.compute.internal:58246) with ID 6 
  16/09/01 12:52:10 INFO scheduler.TaskSetManager: Starting task 15.1 in stage 0.1 (TID 323, ip-10-244-50-5.us-west-2.compute.internal, partition 93,NODE_LOCAL, 2336 bytes) 
  16/09/01 12:52:10 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-10-244-50-5.us-west-2.compute.internal:36844 with 27.1 GB RAM, BlockManagerId(6, ip-10-244-50-5.us-west-2.compute.internal, 36844) 
  16/09/01 12:52:15 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on ip-10-244-12-3.us-west-2.compute.internal:42626 (size: 3.3 KB, free: 27.1 GB) 
  16/09/01 12:52:39 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on ip-10-244-50-5.us-west-2.compute.internal:36844 (size: 3.3 KB, free: 27.1 GB) 
1223.028: [GC1223.078: [SoftReference, 0 refs, 0.0000760 secs]1223.078: [WeakReference, 267 refs, 0.0000480 secs]1223.078: [FinalReference, 633 refs, 0.0005880 secs]1223.078: [PhantomReference, 0 refs, 57 refs, 0.0000380 secs]1223.078: [JNI Weak Reference, 0.0000210 secs]AdaptiveSizePolicy::compute_survivor_space_size_and_thresh:  survived: 70949016  promoted: 20233616  overflow: falseAdaptiveSizeStart: 1223.080 collection: 7 
  avg_survived_padded_avg: 303680288.000000  avg_promoted_padded_avg: 181112096.000000  avg_pretenured_padded_avg: 0.000000  tenuring_thresh: 4  target_size: 238551040 
  AdaptiveSizePolicy::compute_generation_free_space limits: desired_eden_size: 478150656 old_eden_size: 537919488 eden_limit: 388497408 cur_eden: 239075328 max_eden_size: 388497408 avg_young_live: 75442184PS            AdaptiveSizePolicy::compute_generation_free_space: costs minor_time: 0.036744 major_cost: 0.000000 mutator_cost: 0.963256 throughput_goal: 0.990000 live_space: 397243136 free_space: 1075838976 old_promo_size: 537919488 old_eden_size: 537919488 desired_promo_size: 537919488 desired_eden_size: 388497408 
   AdaptiveSizePolicy::survivor space sizes: collection: 7 (238551040, 89128960) -> (238551040, 238551040) 
  AdaptiveSizeStop: collection: 7 [PSYoungGen: 320499K->69286K(466432K)] 733980K->502526K(1864704K), 0.0519680 secs] [Times: user=0.47 sys=0.02, real=0.05 secs] 

Problem Statment:
The same job is running successfully on C3 8xLarge with the following resources given to the job: 16 cores and 118.8 GB ram per node (6 nodes in total) using cloudera services. I want to move this job to Docker environment and run it on C3 4xLarge machines since C3 8xLarge has a lot of unused resources which cost me too much.
But the job is failing on the docker environment.
Afterwards, I tried running the job different machines in separate environment. The job failed every time so I added a new stage to the job which uncompresses the data downloaded from s3 and saved it to hdfs then creates a RDD from it. Sometimes the job failed, sometimes it ran successfully. In the end, I tried running it on c3 4xLarge. I tried running the job with very few resources and gradually increased them. The job ran successfully when I gave fewer resources to it. but failed when I gave more resources. It took me a lot of time to run this job successfully on docker environment. Afterward, I tried running it on C3 8xLarge machines and the job was successful up to 80% resources allocation, previously it failed just after 65% resources, but the original job which directly created an RDD from the compressed data files fails every time.
Is this a generic issue and can be fixed? or this was a job oriented issue? How can I avoid this issue in future? How can I tune my configurations to utilize maximum resources?

What I have done so far:

  1. changed the ram and core for the job.
  2. increased the driver memory for the Job.
  3. changed the permission of the /tmp/spark folder.
  4. changed the spark.local.dir parameter in the configuration files, sometimes the /tmp/ is in memory which could cause insufficient space.
  5. checked the logs to see if there is any other error before hand like toomanyfilesopen which could cause the existing spark shuffle file issues.
  6. restarted the spark.
  7. provided the resources in proportion with the existing cluster where the job is running now.
  8. tested other components of the job on the cluster i.e hdfs, aerospike individually
  9. checked the memory on the drives, which is enough for the Job the create temporary files.
  10. tried to run the command using sudo.
  11. cleaned the /tmp/spark* leftover so they may not interrupt with the job.

What I believe:
After my time on this job, I came to a conclusion that docker also needs some resources to run its own services. When I give more resources to the job, docker doesn’t get enough resources and the shuffle exception occurs. but when I run the same job with fewer resources the job executes successfully. Below are the stats of the job I ran on c3 4xLarge on both docker and manual environment:

1st Attempt:
_Docker vs Ubuntu _
_Services: Hadoop 2.6.4 vs Hadoop 2.6.4 _
_6 data nodes vs 6 data nodes _
_Spark 1.6 vs Spark 1.6 _
_6 workers vs 6 workers _

_Resources:15gb ram per node out of 23 (65%) and 18gb ram per node out of 28 (65%) vs 70 cores out of 96 (72%) and 70 cores out of 96 (72%) _
_Max GC Time: 25sec 10 sec _
_Total Time: 1 hr 1 mins vs 49 mins _
Status: Successful vs Sucessful
_ _
2nd Attempt
_Services:Hadoop 2.6.4 vs Hadoop 2.6.4 _
_6 data nodes vs 6 data nodes _
_Spark 1.6 vs Spark 1.6 _
_6 workers vs 6 workers _

_Resources:16gb ram per node out of 23 (69%) and 20gb ram per node out of 28 (72%) vs 70 cores out of 96 (72%) and 70 cores out of 96 (72%) _
_Max GC Time~13 sec _
_Total Time~43 mins _
Status: successful with warnings and exception vs Sucessful

3rd Attempt
_Services:Hadoop 2.6.4 vs Hadoop 2.6.4 _
_6 data nodes vs 6 data nodes _
_Spark 1.6 vs Spark 1.6 _
_6 workers vs 6 workers _

_Resources:18gb ram per node out of 23 (78%) and 25gb ram per node out of 28 (90%) vs 70 cores out of 96 (72%) and 70 cores out of 96 (72%) _
Status: Failure due to metaDataFetchFail exception vs Successful with Exceptions