Why is collect in SparkR slow

I have a huge dataframe which has more than 300k rows. The data is in a parquet file. My machine has 4 cores and 8GB of RAM.

R version 3.0 and spark 2.0. To bring dataset into R, I used the collect() function. It took around 3-4 mins for the data to be loaded into R.

Does the collect method usually take this much time?

May 3, 2018 in Apache Spark by shams
• 3,670 points • 4,187 views

1 answer to this question.

It's not the collect() that is slow. Actually, Spark works on the principle of Lazy evaluations, ie. all the transformations are done in a DAG basis and the actions (here it's the collect()) is done at last using the original data, so that's why it might take time.

But having a 300K row data will take some time in loading.

answered May 3, 2018 by Data_Nerd
• 2,390 points

Related Questions In Apache Spark

0 votes

1 answer

How to use yield keyword in scala and why it is used instead of println?

Hi, The yield keyword is used because the ...READ MORE

answered Jul 6, 2019 in Apache Spark by Gitika
• 65,730 points • 2,933 views

0 votes

1 answer

Why is Spark faster than Hadoop Map Reduce

Firstly, it's the In-memory computation, if the file ...READ MORE

answered Apr 30, 2018 in Apache Spark by shams
• 3,670 points • 2,568 views

+1 vote

1 answer

Can anyone explain what is RDD in Spark?

RDD is a fundamental data structure of ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,490 points • 4,010 views

0 votes

1 answer

Spark 2.3? What is new in it?

Here are the changes in new version ...READ MORE

answered May 28, 2018 in Apache Spark by kurt_cobain
• 9,350 points • 1,764 views

+1 vote

1 answer

is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [51, 53, 10, 10]

Hi@akhtar, Here you are trying to read a ...READ MORE

answered Feb 3, 2020 in Apache Spark by MD
• 95,460 points • 21,105 views

0 votes

1 answer

What do we exactly mean by “Hadoop” – the definition of Hadoop?

The official definition of Apache Hadoop given ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by Shubham
• 3,676 views

+1 vote

1 answer

I installed Spark but while executing command, I am getting ‘hadoop’ command not found error?

For accessing Hadoop commands & HDFS, you ...READ MORE

answered Mar 21, 2018 in Big Data Hadoop by Shubham
• 13,490 points • 4,219 views

0 votes

3 answers

Can we run Spark without using Hadoop?

No, you can run spark without hadoop. ...READ MORE

answered May 7, 2019 in Big Data Hadoop by pradeep
• 4,405 views

0 votes

1 answer

cache tables in apache spark sql

Caching the tables puts the whole table ...READ MORE

answered May 4, 2018 in Apache Spark by Data_Nerd
• 2,390 points • 4,778 views

0 votes

1 answer

Is it possible to run Spark and Mesos along with Hadoop?

Yes, it is possible to run Spark ...READ MORE

answered May 29, 2018 in Apache Spark by Data_Nerd
• 2,390 points • 1,946 views

Subscribe to our Newsletter, and get personalized recommendations.

REGISTER FOR FREE WEBINAR

Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP