Using map on dataset with arbitrary rows in Spark SQL - java

I'm trying to use the Dataframe map function on an arbitrary dataset. However I don't understand how you would map from Row-> Row. No examples are given for arbitrary data in the spark sql documentation:
Dataset<Row> original_data = ...
Dataset<Row> changed_data = original_data.map(new MapFunction<Row,Row>{
#Override
public Row call(Row row) throws Exception {
Row newRow = RowFactory.create(obj1,obj2);
return newRow;
}
}, Encoders.bean(Row.class));
However this does not work since there needs to be some sort of Encoder?
How can I map to a generic Row?

If obj1 and and obj2 are not primitive type then represent their schema to StructType to create Row encoder. I would suggest instead of using Row type, create custom bean which stores both obj1 and obj2 then use that custom bean encoder in map transformation.
Row type:
StructType customStructType = new StructType();
customStructType = customStructType.add("obj1", DataTypes.< type>, false);
customStructType = customStructType.add("obj2", DataTypes.< type >, false);
ExpressionEncoder<Row> customTypeEncoder = null;
Dataset<Row> changed_data = original_data.map(row->{
return RowFactory.create(obj1,obj2);;
}, RowEncoder.apply(customStructType));
Custom Bean type:
class CustomBean implements ....{
Object obj1;
Object obj2;
....
}
Dataset<CustomBean> changed_data = original_data.map(row->{
return new CustomBean(obj1,obj2);
}, Encoders.bean(CustomBean));

Related

How to convert ArrayList into Scala Array in Spark

I want to create a StructType dynamically out of a json file. I iterate my fields and I want to find out how (if it is even possible) can I create some list with my iteration, and then to create a StructType from it.
The code I've tried:
List<StructField> structFieldList = new ArrayList<>();
for (String field : fields.values()) {
StructField sf = Datatypes.createStructField(field, DataTypes.StringType, true);
structFieldList.add(sf);
}
StructType structType = new StructType(structFieldList.toArray());
But this one is pretty impossible. Is there any way to figure this out?
Here you don't need to convert ArrayList to Scala Array, as StructType constructor takes java StructField[] array as argument.
Your code can be changed by setting type when calling .toArray() method in your last line of code snippet so it returns a StructField[] array instead of an Object[] array:
List<StructField> structFieldList = new ArrayList<>();
for (String field : fields.values()) {
StructField sf = DataTypes.createStructField(field, DataTypes.StringType, true);
structFieldList.add(sf);
}
StructType structType = new StructType(structFieldList.toArray(new StructField[0]));

Spark Java: How to sort ArrayType(MapType) by key and access the values of the sorted array

So, I have a dataframe with this schema:
StructType(StructField(experimentid,StringType,true), StructField(descinten,ArrayType(MapType(StringType,DoubleType,true),true),true))
and the content is like this:
+----------------+-------------------------------------------------------------+
|experimentid |descinten |
+----------------+-------------------------------------------------------------+
|id1 |[[x1->120.51513], [x2->57.59762], [x3->83028.867]] |
|id2 |[[x2->478.5698], [x3->79.6873], [x1->341.89]] |
+----------------+-------------------------------------------------------------+
I want to sort "descinten" by key in ascending order and then take the sorted values. I tried with mapping and sorting each row separately but I was getting errors like:
ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.Map
or similar. Is there a more straight forward way to do it in Java?
For anyone interested I managed to solve it with map and TreeMap for sorting. My aim was to create vectors of the values in an ascending order based on their key.
StructType schema = new StructType(new StructField[] {
new StructField("expids",DataTypes.StringType, false,Metadata.empty()),
new StructField("intensity",new VectorUDT(),false,Metadata.empty())
});
Dataset<Row> newdf=olddf.map((Row originalrow) -> {
String firstpos = new String();
firstpos=originalrow.get(0).toString();
List<scala.collection.Map<String,Double>>mplist=originalrow.getList(1);
int s = mplist.size();
Map<String,Double>treemp=new TreeMap<>();
for(int k=0;k<s;k++){
Object[] exid = JavaConversions.mapAsJavaMap(mplist.get(k)).values().toArray();
Object[] kvlist= JavaConversions.mapAsJavaMap(mplist.get(k)).keySet().toArray();
treemp.put(exid[0].toString(),Double.parseDouble(kvlist[0].toString()));
}
Object[] yo1 = treemp.values().toArray();
double[] tmplist= new double[s];
for(int i=0;i<s;i++){
tmplist[i]=Double.parseDouble(yo1[i].toString());
}
Row newrow = RowFactory.create(firstpos,Vectors.dense(tmplist));
return newrow;
}, RowEncoder.apply(schema));

Spring JDBCTemplate : Construct JSON without special characters

I am reading table from postgreSQL DB and populating all columns and its values in a json object.
One of the column in postgre is of type json. So the output has lot of escape characters. like below for key dummykeyname.
{
"XY": "900144",
"id": 1,
"date": 1556167980000,
"type": "XX50",
"dummykeyname": {
"type": "json",
"value": "{\"XXXX\": 14445.0, \"YYYY\": 94253.0}"
}
}
I want the output to look like
"value": "{"XXXX": 14445.0, "YYYY": 94253.0}"
Code i used is
JSONArray entities = new JSONArray();
var rm = (RowMapper<?>) (ResultSet result, int rowNum) -> {
while (result.next()) {
JSONObject entity = new JSONObject();
ResultSetMetaData metadata = result.getMetaData();
int columnCount = metadata.getColumnCount() + 1;
IntStream.range(1, columnCount).forEach(nbr -> {
try {
entity.put(result.getMetaData().getColumnName(nbr), result.getObject(nbr));
} catch (SQLException e) {
LOGGER.error(e.getMessage());
}
});
entities.add(entity);
}
return entities;
};
Library used:
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
Please guide me where am i going wrong.
Take a different approach.
1) first create a pojo of the required columns
ex : if your table has 4 columns
id, name, country, mobile create a class Employee and populate the class using rowmapper available in spring jdbc.
2) create a class EmployeeList, which has List , add each Employee objects created out of rowmapper to declared list.
3) use
a) ObjectMapper mapper = new ObjectMapper();
b) mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL);
c) mapper.writeValueAsString(EmployeeListobjects);

Unable to understand UDFs in Spark and especially in Java

I am trying to create a new column in Spark Datasets based on another column's value. The other column's value is searched in a json file as a key and the returned its value which is the value to be used for new column.
Here is then code that I tried but it doesn't work and I am not sure how UDF's work as well. How do you add a column in this case using withColumn or udf?
Dataset<Row> df = spark.read().format("csv").option("header", "true").load("file path");
Object obj = new JSONParser().parse(new FileReader("json path"));
JSONObject jo = (JSONObject) obj;
df = df.withColumn("cluster", functions.lit(jo.get(df.col("existing col_name")))));
Any help will be appreciated. Thanks in advance!
Spark allows you to create custom User Defined Functions(UDFs) using udf function.
Following is a scala snippet of how to define a UDF.
val obj = new JSONParser().parse(new FileReader("json path"));
val jo = obj.asInstanceOf[JSONObject];
def getJSONObject(key: String) = {
jo.get(key)
}
Once you have defined your function, you can convert it to a UDF as:
val getObject = udf(getJSONObject _)
There are two approaches for using UDF.
df.withColumn("cluster", lit(getObject(col("existing_col_name"))))
If you are using spark sql, you have to register your udf in sqlContext before you use it.
spark.sqlContext.udf.register("get_object", getJSONObject _)
And then you can use it as
spark.sql("select get_object(existing_column) from some_table")
Out of these, which to use is completely subjective.
Thanks #Constantine. I was able to better understand UDFs from your example. Here is my java code:
Object obj = new JSONParser().parse(new FileReader("json path"));
JSONObject jo = (JSONObject) obj;
spark.udf().register("getJsonVal", new UDF1<String, String>() {
#Override
public String call(String key) {
return (String) jo.get(key.substring(0, 5));
}
}, DataTypes.StringType);
df = df.withColumn("cluster", functions.callUDF("getJsonVal", df.col("existing col_name")));
df.show(); // SHOWS NEW CLUSTER COLUMN

Append a column to Data Frame in Apache Spark 1.4 in Java

I am trying to add a column to my DataFrame that serves as a unique ROW_ID for the column. So, it would be something like this
1, user1
2, user2
3, user3
...
I could have done this easily using a hashMap with an integer iterating but I can't do this in spark using the map function on DataFrame since I can't have an integer increasing inside the map function. Is there any way that I can do this by appending one column to my existing DataFrame or any other way?
PS: I know there is a very similar post, but that's for Scala and not java.
Thanks in advance
I did it by adding a column containing UUIDs in a new Column in DataFrame.
StructType objStructType = inputDataFrame.schema();
StructField []arrStructField=objStructType.fields();
List<StructField> fields = new ArrayList<StructField>();
List<StructField> newfields = new ArrayList<StructField>();
List <StructField> listFields = Arrays.asList(arrStructField);
StructField a = DataTypes.createStructField(leftCol,DataTypes.StringType, true);
fields.add(a);
newfields.addAll(listFields);
newfields.addAll(fields);
final int size = objStructType.size();
JavaRDD<Row> rowRDD = inputDataFrame.javaRDD().map(new Function<Row, Row>() {
private static final long serialVersionUID = 3280804931696581264L;
public Row call(Row tblRow) throws Exception {
Object[] newRow = new Object[size+1];
int rowSize= tblRow.length();
for (int itr = 0; itr < rowSize; itr++)
{
if(tblRow.apply(itr)!=null)
{
newRow[itr] = tblRow.apply(itr);
}
}
newRow[size] = UUID.randomUUID().toString();
return RowFactory.create(newRow);
}
});
inputDataFrame = objsqlContext.createDataFrame(rowRDD, DataTypes.createStructType(newfields));
Ok, I found the solution to this problem and I'm posting it in case someone would have the same problem:
The way to do this it zipWithIndex from JavaRDD()
df.javaRDD().zipWithIndex().map(new Function<Tuple2<Row, Long>, Row>() {
#Override
public Row call(Tuple2<Row, Long> v1) throws Exception {
return RowFactory.create(v1._1().getString(0), v1._2());
}
})

Categories