How can I calculate exact median with Apache Spark

0 votes

This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median?

Oct 8, 2018 in Big Data Hadoop by slayer
• 29,370 points
4,414 views

1 answer to this question.

0 votes

You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

  import org.apache.spark.SparkContext._

  val rdd: RDD[Int] = ???

  val sorted = rdd.sortBy(identity).zipWithIndex().map {
    case (v, idx) => (idx, v)
  }

  val count = sorted.count()

  val median: Double = if (count % 2 == 0) {
    val l = count / 2 - 1
    val r = l + 1
    (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
  } else sorted.lookup(count / 2).head.toDouble
answered Oct 8, 2018 by Omkar
• 69,220 points

Related Questions In Big Data Hadoop

0 votes
0 answers

How I can kill the jobs using jobID running in local mode with Hadoop

I am Running hadoop jobs in local ...READ MORE

Aug 26, 2020 in Big Data Hadoop by kamboj
• 140 points
2,302 views
0 votes
1 answer

How can I import data from mysql to hive tables with incremental data?

Hi@dharmendra, It is common to perform one-time ingestion ...READ MORE

answered Nov 23, 2020 in Big Data Hadoop by MD
• 95,460 points
1,290 views
0 votes
1 answer

How can I download only hdfs and not hadoop?

No, you cannot download HDFS alone because ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
1,542 views
0 votes
1 answer

How can I download hadoop documentation for a specific version?

You can go through this SVN link:- ...READ MORE

answered Mar 22, 2018 in Big Data Hadoop by Shubham
• 13,490 points
889 views
0 votes
1 answer

How can I get the respective Bitcoin value for an input in USD when using c#

Simply make call to server and parse ...READ MORE

answered Mar 25, 2018 in Big Data Hadoop by charlie_brown
• 7,720 points
1,048 views
0 votes
1 answer

How do I connect my Spark based HDInsight cluster to my blob storage?

Go through this blog: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#access-blobs I went through this ...READ MORE

answered Apr 15, 2018 in Big Data Hadoop by Shubham
• 13,490 points
2,097 views
0 votes
1 answer

How can I put file to HDFS directly without copying it local disk?

Can use pipe from wget to hdfs. You ...READ MORE

answered Apr 15, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
4,601 views
0 votes
2 answers

How can I list NameNode & DataNodes from any machine in the Hadoop cluster?

You can browse hadoop page from any ...READ MORE

answered Jan 23, 2020 in Big Data Hadoop by MD
• 95,460 points
11,638 views
0 votes
1 answer

Where can I find logs in Spark on YARN?

You can access logs through the command yarn ...READ MORE

answered Nov 8, 2018 in Big Data Hadoop by Omkar
• 69,220 points
1,021 views
0 votes
1 answer

In Hadoop MapReduce, how can i set an Object as the Value for Map output?

Try this and see if it works: public ...READ MORE

answered Nov 21, 2018 in Big Data Hadoop by Omkar
• 69,220 points
975 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP