Spark Dataframe vs Dataset

+1 vote
What is the difference between Dataframe and Dataset, which one is preferred to use in our project?
Jul 29, 2019 in Apache Spark by Jesse
45,899 views

2 answers to this question.

+1 vote
Best answer

Recently, there are two new data abstractions released dataframe and datasets in apache spark. Now, it might be difficult to understand the relevance of each one. Also, not easy to decide which one to use and which one not to.

DataFrames

DataFrames gives a schema view of data basically, it is an abstraction. In dataframes, view of data is organized as columns with column name and types info. In addition, we can say data in dataframe is as same as the table in relational database.

As similar as RDD, execution in dataframe too is lazy triggered. Moreover, to allow efficient processing datasets is structure as a distributed collection of data. Spark also uses catalyst optimizer along with dataframes.

DataSets

In Spark, datasets are an extension of dataframes. Basically, it earns two different APIs characteristics, such as strongly typed and untyped. Datasets are by default a collection of strongly typed JVM objects, unlike dataframes. Moreover, it uses Spark’s Catalyst optimizer. For exposing expressions & data field to a query planner.

Now we will see the difference in both based on certain features:

Spark Release

DataFrame- In Spark 1.3 Release, dataframes are introduced.

whereas,

DataSets- In Spark 1.6 Release, datasets are introduced.

Data Formats

DataFrame- Dataframes organizes the data in the named column. Basically, dataframes can efficiently process unstructured and structured data. Also, allows the Spark to manage schema.

whereas,

DataSets- As similar as dataframes, it also efficiently processes unstructured and structured data. Also, represents data in the form of a collection of row object or JVM objects of row. Through encoders, is represented in tabular forms.

Data Representation

DataFrame- In dataframe data is organized into named columns. Basically, it is as same as a table in a relational database.

whereas,

DataSets- As we know, it is an extension of dataframe API, which provides the functionality of type-safe, object-oriented programming interface of the RDD API. Also, performance benefits of the Catalyst query optimizer.

Compile-time type safety

DataFrame- There is a case if we try to access the column which is not on the table. Then, dataframe APIs does not support compile-time error.

whereas,

DataSets- Datasets offers compile-time type safety.

Data Sources API

DataFrame- It allows data processing in different formats, for example, AVRO, CSV, JSON, and storage system HDFS, HIVE tables, MySQL.

whereas,

DataSets- It also supports data from different sources.

Immutability and Interoperability

DataFrame- Once transforming into dataframe, we cannot regenerate a domain object.

whereas,

DataSets- Datasets overcomes this drawback of dataframe to regenerate the RDD from dataframe. It also allows us to convert our existing RDD and dataframes into datasets.

Efficiency/Memory use

DataFrame- By using off-heap memory for serialization, reduce the overhead.

whereas,

DataSets- It allows to perform an operation on serialized data. Also, improves memory usage.

Serialization

DataFrame- In dataframe, can serialize data into off-heap storage in binary format. Afterwards, it performs many transformations directly on this off-heap memory.

whereas,

DataSets- In Spark, dataset API has the concept of an encoder. Basically, it handles conversion between JVM objects to tabular representation. Moreover, by using spark internal tungsten binary format it stores, tabular representation. Also, allows to perform an operation on serialized data and also improves memory usage.

Lazy Evolution

DataFrame- As same as RDD, Spark evaluates dataframe lazily too.

whereas,

DataSets- As similar to RDD, and Dataset it also evaluates lazily.

Optimization

DataFrame- Through spark catalyst optimizer, optimization takes place in dataframe. 

whereas,

DataSets- For optimizing query plan, it offers the concept of dataframe catalyst optimizer.

Schema Projection\

DataFrame- Through the Hive meta store, it auto-discovers the schema. We do not need to specify the schema manually.

whereas,

DataSets- Because of using spark SQL engine, it auto discovers the schema of the files.

Programming Language Support

DataFrame- In 4 languages like Java, Python, Scala, and R dataframes are available.

whereas,

DataSets- Only available in Scala and Java.

answered Jul 29, 2019 by Jackie

selected Dec 15, 2020 by MD
0 votes

Hi,

Spark DataFrames are organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.

Datasets in Apache Spark are an extension of DataFrame API which provides a type-safe, object-oriented programming interface. Dataset takes advantage of Spark’s Catalyst optimizer by exposing expressions and data fields to a query planner.

answered Dec 15, 2020 by MD
• 95,460 points

Related Questions In Apache Spark

0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,650 points
13,900 views
+1 vote
2 answers

Apache Spark vs Apache Spark 2

Spark 2 doesn't differ much architecture-wise from ...READ MORE

answered Apr 24, 2018 in Apache Spark by kurt_cobain
• 9,350 points
9,339 views
0 votes
1 answer

How to get Spark dataset metadata?

There are a bunch of functions that ...READ MORE

answered Apr 26, 2018 in Apache Spark by kurt_cobain
• 9,350 points
4,967 views
0 votes
3 answers

How to transpose Spark DataFrame?

Please check the below mentioned links for ...READ MORE

answered Jan 1, 2019 in Apache Spark by anonymous
20,073 views
+1 vote
2 answers
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,152 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,644 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
109,423 views
+5 votes
11 answers

Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

answered Mar 21, 2019 in Apache Spark by anonymous
72,607 views
+1 vote
8 answers

How to replace null values in Spark DataFrame?

Hi, In Spark, fill() function of DataFrameNaFunctions class is used to replace ...READ MORE

answered Dec 15, 2020 in Apache Spark by MD
• 95,460 points
75,526 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP