Dropping the unnecessary columns from the dataframe which was constructed from durham-parks.json.

df = df
    .drop ("fields")
    .drop ("geometry")
    .drop ("record_timestamp")
    .drop ("recordid")
    .drop ("datasetid")
    ;

The dataframe now constitutes the following data.

Dropping some of the columns in philadelphia_recreations.csv

.drop("SITE_NAME")
.drop("OBJECTID")
.drop("CHILD_OF")
.drop("TYPE")
.drop("USE_")
.drop("DESCRIPTION")
.drop("SQ_FEET")
.drop("ALLIAS")
.drop("CHRONOLOGY")
.drop("NOTES")
.drop("DATE_EDITED")
.drop("EDITED_BY")
.drop("OCCUPANT")
.drop("TENANT")
.drop("LABEL")

The refined dataframe is shown below.

For clarity we have changes some of the column names. The java code that does the required data mining is shown below.

Dataset<Row> df = spark.read().format("csv").option("multiline", true)
                .option("header", true)
                .load("src/main/resources/philadelphia_recreations.csv");

//		df = df.filter(lower(df.col("USE_")).like("%park%"));
        df = df.filter("lower(USE_) like '%park%' ");

        df = df.withColumn("park_id", concat(lit("phil_"), df.col("OBJECTID")))
                .withColumnRenamed("ASSET_NAME", "park_name")
                .withColumn("city", lit("Philadelphia"))
                .withColumnRenamed("ADDRESS", "address")
                .withColumn("has_playground", lit("UNKNOWN"))
                .withColumnRenamed("ZIPCODE", "zipcode")
                .withColumnRenamed("ACREAGE", "land_in_acres")
                .withColumn("geoX", lit("UNKNONW"))
                .withColumn("geoY", lit("UNKNONW"))
                .drop("SITE_NAME")
                .drop("OBJECTID")
                .drop("CHILD_OF")
                .drop("TYPE")
                .drop("USE_")
                .drop("DESCRIPTION")
                .drop("SQ_FEET")
                .drop("ALLIAS")
                .drop("CHRONOLOGY")
                .drop("NOTES")
                .drop("DATE_EDITED")
                .drop("EDITED_BY")
                .drop("OCCUPANT")
                .drop("TENANT")
                .drop("LABEL");

        return df;

Prev


Next