Spark Processing Internals

0 votes
Hi Team,

I would like to know when a job is submitted to spark what is the process details that follows. I mean how the Driver submits tasks to executors and how the executors send a response that they are alive to the driver and moreover what is the fault tolerance method in case the Executor fails. The overall details of spark processing in depth
Jul 15, 2019 in Apache Spark by John
2,024 views

1 answer to this question.

0 votes

Spark cluster components

Spark uses a master/slave architecture. As you can see in the figure, it has one central coordinator (Driver) that communicates with many distributed workers (executors). The driver and each of the executors run in their own Java processes.

DRIVER

The driver is the process where the Main method runs. First, it converts the user program into tasks and after that, it schedules the tasks on the executors.

EXECUTORS

Executors are worker nodes' processes in charge of running individual tasks in a given Spark job. They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application. Once they have run the task they send the results to the driver. They also provide in-memory storage for RDDs that are cached by user programs through Block Manager.

APPLICATION EXECUTION FLOW

With this in mind, when you submit an application to the cluster with spark-submit this is what happens internally:

=> A standalone application starts and instantiates a SparkContext instance (and it is only then when you can call the application a driver).

=> The driver program ask for resources to the cluster manager to launch executors.

=> The cluster manager launches executors.

=> The driver process runs through the user application. Depending on the actions and transformations over RDDs tasks are sent to executors.

=> Executors run the tasks and save the results.

=> If any worker crashes, its tasks will be sent to different executors to be processed again.

=> With SparkContext.stop() from the driver or if the main method exits/crashes all the executors will be terminated and the cluster resources will be released by the cluster manager.

In case of failures,

Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For example, if the node running a partition of a map() operation crashes, Spark will rerun it on another node; and even if the node does not crash but is simply much slower than other nodes, Spark can preemptively launch a “speculative” copy of the task on another node, and take its result if that finishes.

Executors report heartbeat and partial metrics for active tasks to HeartbeatReceiver on the driver. By default, Interval after which an executor reports heartbeat and metrics for active tasks to the driver. is 10 seconds.

I hope this helps. Jimmy

answered Jul 15, 2019 by Jimmy

edited Jun 9, 2020 by MD

Related Questions In Apache Spark

0 votes
1 answer

Do real-time data processing is possible with Spark SQL?

Hey, Real-time data processing is not possible directly ...READ MORE

answered Jul 5, 2019 in Apache Spark by Gitika
• 65,770 points
1,453 views
0 votes
1 answer

Spark memory processing on a not temporary table

Temporary table is more like an index ...READ MORE

answered Jul 14, 2019 in Apache Spark by Suri
1,540 views
0 votes
1 answer

Is Spark Sql provides indexing to improve processing speed?

Hi@akhtar, There is no concept of indexing in ...READ MORE

answered Feb 4, 2020 in Apache Spark by MD
• 95,460 points
911 views
0 votes
1 answer
+1 vote
2 answers
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,029 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,536 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
108,832 views
0 votes
1 answer

In what kind of use cases has Spark outperformed Hadoop in processing?

I can list some but there can ...READ MORE

answered Sep 19, 2018 in Apache Spark by zombie
• 3,790 points
1,128 views
+5 votes
11 answers

Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

answered Mar 21, 2019 in Apache Spark by anonymous
72,378 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP