Dataset vs Dataframe

To perform Big Data processing we often use Dataset and Dataframes a lot. It is important to understand how they work and what is the point of difference between them.

To create a Dataset we import org.apache.spark.sql.Dataset class.

import org.apache.spark.sql.Dataset;

Now let’s create a spark session.

SparkSession spark = new SparkSession.Builder()
                .appName("Array to Dataset<String>")

The Dataset can take user defined types as well. If we declare Dataset with Row then it is referred to as Dataframe. If we try to declare Dataset with user-defined data type then it is named as Dataset.

The below code demonstrates declaring Dataset and Dataframe.

 Dataset with user-defined type Student

// array of student names
String[] studentNames = new String[] {
    "Shashank J",
    "Shashank K L",

// create a collection List
List<Student> data = Arrays.asList(studentNames);

Dataset<Student> ds = spark.createDataset (data, Encoders.STRING());

Now we try to print what’s stored in ds and also print its schema.



 |-- value: string (nullable = true)

and the contents of the Dataset are:

|   value|
| Bannana|
|     Car|
|   Glass|
|     Car|