Why do we use sc parallelize

0 votes

Could you please let me know when RDD is already distributed over nodes in a cluster and will be acted upon in parallel, what is the use of parallelize. Why do we use sc.parallelize?

Jul 11, 2019 in Apache Spark by Sumit
13,474 views

1 answer to this question.

0 votes

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

Now this RDD creation can be done in two ways:

First, is to refer to an external dataset present in the hdfs or local i.e,

sc.textFile("/user/edureka_425640/patient_records.txt")

Second, is parallelizing an existing collection using sc.parallelize i.e., sc.parallelize API will help in loading user created data which is not mandatorily coming from a directory.

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

So, when we are using sc.parallelize, we are actually using it for RDD creation only.

answered Jul 11, 2019 by Suman

Related Questions In Apache Spark

0 votes
1 answer

How to use Scala anonymous functions and why do we use it?

Hi, Anonymous functions in Scala is the lightweight ...READ MORE

answered Jul 26, 2019 in Apache Spark by Gitika
• 65,770 points
864 views
0 votes
1 answer

Why do we need App in Scala?

Hey, The app is a helper class that ...READ MORE

answered Jul 24, 2019 in Apache Spark by Gitika
• 65,770 points
702 views
0 votes
1 answer

What do we mean by an RDD in Spark?

The full form of RDD is a ...READ MORE

answered Jun 18, 2018 in Apache Spark by nitinrawat895
• 11,380 points
4,045 views
–1 vote
1 answer

Not able to use sc in spark shell

Seems like master and worker are not ...READ MORE

answered Jan 3, 2019 in Apache Spark by Omkar
• 69,220 points
1,759 views
+1 vote
2 answers
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,028 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,535 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
108,830 views
0 votes
1 answer

In what kind of use cases has Spark outperformed Hadoop in processing?

I can list some but there can ...READ MORE

answered Sep 19, 2018 in Apache Spark by zombie
• 3,790 points
1,128 views
0 votes
1 answer

Spark context (sc) not found

Maybe the hadoop service didn't start properly. Try ...READ MORE

answered Feb 14, 2019 in Apache Spark by John
2,073 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP