How to get column names of Spark Row using java - java

I am trying to convert a spark dataframe to rdd and apply a function using map.
In pyspark, we can fetch the values of corresponding column by converting the Row to dictionary (key being column name, value being the value of that column) as below
row_dict = Row.asDict()
val = row_dict['column1'] # I can access the value of any column
Now, in java, I am trying to do similar thing. I am getting the Row and I found that it has APIs to get the values based on index value
JavaRDD<Row> resultRdd = df.JavaRDD().map(x -> customFunction(x, customParam1, customParam2));
public static Row customFunction(Row row, Object o1, Object o2) {
// need to access "column1" value from the row
// how to get column name of each index if we have to use row.get(index)
}
How can I access the row values based on column names in java code?

Related

Read values from Java Map using Spark Column using java

I have tried below code to get Map values via spark column in java but getting null value expecting exact value from Map as per key search.
and Spark Dataset contains one column and name is KEY and dataset name dataset1
values in dataset :
KEY
1
2
Java Code -
Map<String,string> map1 = new HashMap<>();
map1.put("1","CUST1");
map1.put("2","CUST2");
dataset1.withColumn("ABCD", functions.lit(map1.get(col("KEY"))));
Current Output is:
ABCD (Column name)
null
null
Expected Output :
ABCD (Column name)
CUST1
CUST2
please me get this expected output.
The reason why you get this output is pretty simple. The get function in java can take any object as input. If that object is not in the map, the result is null.
The lit function in spark is used to create a single value column (all rows have the same value). e.g. lit(1) creates a column that takes the value 1 for each row.
Here, map1.get(col("KEY")) (that is executed on the driver), asks map1 the value corresponding to a column object (not the value inside the column, the java/scala object representing the column). The map does not contain that object so the result is null. Therefore, you could as well write lit(null). This is why you get a null result inside your dataset.
To solve your problem, you could wrap your map access within a UDF for instance. Something like:
UserDefinedFunction map_udf = udf(new UDF1<String, String>() {
#Override
public String call(String x) {
return map1.get(x);
}
}, DataTypes.StringType );
spark.udf().register("map_udf", map_udf);
result.withColumn("ABCD", expr("map_udf(KEY)"));

How to Alias a DataSet column before writing to a parquet in Java

I am working with apache spark in java and what I am trying to do is filter some data, group it by a specific key and then count the number of elements for each key. At the moment I am doing this:
Dataset<MyBean> rawEvents = readData(spark);
Dataset<MyBean> filtered = rawEvents.filter((FilterFunction<MyBean>) events ->
//filter function
));
KeyValueGroupedDataset<String, MyBean> grouped = filtered
.groupByKey((MapFunction<MyBean, String>) event -> {
return event.getKey();
}, Encoders.STRING());
grouped.count().write().parquet("output.parquet");
It fails to write because: org.apache.spark.sql.AnalysisException: Attribute name "count(1)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
How can I alias the count column so this does not happen?
grouped.count() returns a Dataset<Tuple2<String, Object> in your case.
Essentially, renaming a column in the Dataset object will solve your problem.
You can use withColumnRenamed method of Dataset API.
grouped.count().withColumnRenamed("count(1)", "counts").write().parquet("output.parquet")
After grouped.count() select all columns and also add the alias to count column then use write method.
Example:
import static org.apache.spark.sql.functions.col;
import org.apache.spark.sql.Column;
Column[] colList = { col("column1"), col("column2"), col("count(1)").alias("count") };
grouped.count().select(colList).write.parquet("output.parquet");

Hive Udf, Struct type loses type information. Is there anyway to recover type information

My table has mostly double type columns and some string columns. I created the table using row format serde 'org.openx.data.jsonserde.JsonSerDe'
from a text file.
I first combine these columns using named_struct function and pass it to my udf. Something like this.
select id, my_udf(named_struct("key1", col1, "key2", col2, "key3",col3, "key4", col4), other_udf_param1, other_udf_param2);
So, col1, col2 and col3 are of double type and col4 is of type String.
But all of them get converted as String.
This is a snippet from my evaluate function.
List<? extends StructField> fields = this.dataOI.getAllStructFieldRefs();
for (int i = 0; i < fields.size(); i++) {
System.out.println(fields.get(i).toString());
String canName = this.featuresOI.getStructFieldData(arguments[2].get(), fields.get(i)).getClass().getCanonicalName();
System.out.println(canName + " can name");
System.out.println(this.dataOI.getStructFieldData(arguments[2].get(), fields.get(i)));
}
This returns all of them as strings.
Is there a way I could preserve the column types?
Yes, the column types are preserved in the field Object Inspector. Same behaviour can be oberved on the hive cli for named_struct, for map however the inputs are all converted to strings.

Retrieve DataFrame Values in a Java Array

I am using apache spark. I want to retrieve the values pf DataFrame in a String type array. I have created a table using DataFrame.
dataframe.registerTempTable("table_name");
DataFrame d2=sqlContext.sql("Select * from table_name");
Now I want this data to be retrieved in a java Array(String type would be fine). How can I do that.
You can use collect() method to get Row[]. Each Row contains column values of your Dataframe.If there is single value in each row then you can add them in ArrayList of String. If there are more than one column in each row then use ArrayList of your custom object type and set the properties. In below code instead of printing "Row Data" you can add them in ArrayList.
Row[] dataRows = d2.collect();
for (Row row : dataRows) {
System.out.println("Row : "+row);
for (int i = 0; i < row.length(); i++) {
System.out.println("Row Data : "+row.get(i));
}
}

how to export dynamic column data to excel

I have to export data to excel having dynamic number of columns
using java apache workbook,
on every execution, column details will be saved in ListObject,
which will be dynamically generated and get saved in
List<Object> expColName = new ArrayList<Object>();
From the List , I have to obtain individual values and export into every column of the excel sheet,
for(int i=0; i<expColName.size(); i++){
data.put("1",new Object[] {
expColName.get(i)
});
}
The above code gives only the last column value in the excel sheet
What type is data and how do you read the values from the map?
It seems like you are putting every object into the same "key" of the Map, thats why you only get the last item from the list.
You could try to give it a test with:
for(int i=0; i<expColName.size(); i++){
data.put(i+"",new Object[] {
expColName.get(i)
});
}

Categories