Dataset vs Dataframe
To perform Big Data processing we often use Dataset and Dataframes a lot. It is important to understand how they work and what is the point of difference between them.
To create a Dataset we import org.apache.spark.sql.Dataset
class.
import org.apache.spark.sql.Dataset;
Now let’s create a spark session
.
SparkSession spark = new SparkSession.Builder()
.appName("Array to Dataset<String>")
.master("local")
.getOrCreate();
The Dataset can take user defined types as well. If we declare Dataset
with Row
then it is referred to as Dataframe
. If we try to declare Dataset
with user-defined data type then it is named as Dataset
.
The below code demonstrates declaring Dataset
and Dataframe
.
/*
----------------------------------------
Dataset with user-defined type Student
----------------------------------------
*/
// array of student names
String[] studentNames = new String[] {
"Shashank J",
"Shashank K L",
"Superman",
"Batman",
"Purushotham"
};
// create a collection List
List<Student> data = Arrays.asList(studentNames);
Dataset<Student> ds = spark.createDataset (data, Encoders.STRING());
Now we try to print what’s stored in ds
and also print its schema
.
ds.printSchema();
ouput
root
|-- value: string (nullable = true)
and the contents of the Dataset are:
+--------+
| value|
+--------+
| Bannana|
| Car|
| Glass|
|Computer|
| Car|
+--------+