reduceByKey or reduceByKeyLocally which should be preferred

0 votes

Below are the def 

def reduceByKey(partitioner: Partitioner, func: (V, V) ⇒ V): RDD[(K, V)]


def reduceByKeyLocally(func: (V, V) ⇒ V): Map[K, V]

Both are almost similar

Apr 20, 2018 in Apache Spark by Ashish
• 2,650 points
2,490 views

1 answer to this question.

0 votes
Yes, they both merge the values using an associative reduce function. reduceByKeyLocally returns the result to Master as a Map.

Now talking from a project perspective, reduceByKey the data is distributed among the cluster as it is represented as RDD. reduceByKeyLocally merges all the output to a Single Master (machine) as a Map. This completely defeats the usage of a distributed data, which is necessary while working at a large scale.
answered Apr 20, 2018 by kurt_cobain
• 9,350 points

Related Questions In Apache Spark

0 votes
1 answer

Which is better in term of speed, Shark or Spark?

Spark is a framework for distributed data ...READ MORE

answered Jun 26, 2018 in Apache Spark by nitinrawat895
• 11,380 points
933 views
+1 vote
3 answers

Which cluster type should I choose for Spark?

According to me, start with a standalone ...READ MORE

answered Jun 27, 2018 in Apache Spark by nitinrawat895
• 11,380 points
1,570 views
0 votes
2 answers

Which cluster type should I choose for Spark?

Spark is agnostic to the underlying cluster ...READ MORE

answered Aug 21, 2018 in Apache Spark by zombie
• 3,790 points
2,023 views
0 votes
1 answer

Spark error: Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

Give  read-write permissions to  C:\tmp\hive folder Cd to winutils bin folder ...READ MORE

answered Jul 11, 2019 in Apache Spark by Rajiv
7,713 views
0 votes
1 answer

7)From Schema RDD, data can be cache by which one of the given choices?

Hi, @Ritu, According to the official documentation of Spark 1.2, ...READ MORE

answered Nov 23, 2020 in Apache Spark by Gitika
• 65,770 points
1,970 views
0 votes
1 answer

ReduceByKey Avereage

You can try the code mentioned below ...READ MORE

answered Jan 22, 2019 in Big Data Hadoop by Omkar
• 69,220 points
773 views
0 votes
1 answer

What do we exactly mean by “Hadoop” – the definition of Hadoop?

The official definition of Apache Hadoop given ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by Shubham
1,880 views
+1 vote
1 answer
0 votes
3 answers

Can we run Spark without using Hadoop?

No, you can run spark without hadoop. ...READ MORE

answered May 7, 2019 in Big Data Hadoop by pradeep
2,322 views
0 votes
1 answer

Which query to use for better performance, join in SQL or using Dataset API?

DataFrames and SparkSQL performed almost about the ...READ MORE

answered Apr 19, 2018 in Apache Spark by kurt_cobain
• 9,350 points
1,833 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP