Dropping the unnecessary columns from the dataframe which was constructed from durham-parks.json
.
df = df
.drop ("fields")
.drop ("geometry")
.drop ("record_timestamp")
.drop ("recordid")
.drop ("datasetid")
;
The dataframe now constitutes the following data.
Dropping some of the columns in philadelphia_recreations.csv
.drop("SITE_NAME")
.drop("OBJECTID")
.drop("CHILD_OF")
.drop("TYPE")
.drop("USE_")
.drop("DESCRIPTION")
.drop("SQ_FEET")
.drop("ALLIAS")
.drop("CHRONOLOGY")
.drop("NOTES")
.drop("DATE_EDITED")
.drop("EDITED_BY")
.drop("OCCUPANT")
.drop("TENANT")
.drop("LABEL")
The refined dataframe is shown below.
For clarity we have changes some of the column names. The java code that does the required data mining is shown below.
Dataset<Row> df = spark.read().format("csv").option("multiline", true)
.option("header", true)
.load("src/main/resources/philadelphia_recreations.csv");
// df = df.filter(lower(df.col("USE_")).like("%park%"));
df = df.filter("lower(USE_) like '%park%' ");
df = df.withColumn("park_id", concat(lit("phil_"), df.col("OBJECTID")))
.withColumnRenamed("ASSET_NAME", "park_name")
.withColumn("city", lit("Philadelphia"))
.withColumnRenamed("ADDRESS", "address")
.withColumn("has_playground", lit("UNKNOWN"))
.withColumnRenamed("ZIPCODE", "zipcode")
.withColumnRenamed("ACREAGE", "land_in_acres")
.withColumn("geoX", lit("UNKNONW"))
.withColumn("geoY", lit("UNKNONW"))
.drop("SITE_NAME")
.drop("OBJECTID")
.drop("CHILD_OF")
.drop("TYPE")
.drop("USE_")
.drop("DESCRIPTION")
.drop("SQ_FEET")
.drop("ALLIAS")
.drop("CHRONOLOGY")
.drop("NOTES")
.drop("DATE_EDITED")
.drop("EDITED_BY")
.drop("OCCUPANT")
.drop("TENANT")
.drop("LABEL");
return df;