How to Iterate Dataset<Row> and print each attribute value in java - java

I have loaded a parquet file into a Dataset<Row> in java and I want to iterate it record/row wise and read value of every attribute in that row.
I have got till here
Dataset<Row> df = sparkSession.read().format("parquet").load(location);
df.foreach((ForeachFunction<Row>) row -> {
System.out.println(row);
});
Is there any function for it in java to read attributes of a given row?
PS : I am using java 11 and spark 2.4.0.

Well its a little tricky for java, as java doesn't support searching a column directly by its name, but we can get a column based on index.
Also there is a function which returns the field-index of a given column name. Combining both we can iterate over the attributes of DataSet<Row>
Sample code:
Dataset<Row> df = sparkSession.read().format("parquet").load(location);
df.foreach((ForeachFunction<Row>) row -> {
System.out.println((String) row.get(row.fieldIndex("attribute_name")));
});

Related

Compare two datasets having columns with NULL/Empty values

I am trying to compare two datasets after loading it using Apache Spark:
final SparkSession sparkSession=SparkSession.builder().appName("Final Project").master("local[3]").getOrCreate();
final DataFrameReader reader = sparkSession.read();
reader.option("header", "true");
Dataset<Row> mainDF = reader.csv(mainFile);
Dataset<Row> compareDF = reader.csv(compareFile);
mainDF.createOrReplaceTempView("main");
compareDF.createOrReplaceTempView("compare");
Dataset<Row> joinDF = sparkSession.sql("SELECT * FROM (SELECT 'main' AS main, main.* FROM main) main NATURAL FULL JOIN (SELECT 'compare' AS compare, compare.* FROM compare) compare WHERE main IS NULL OR compare IS NULL");
joinDF.coalesce(1).write().mode("overwrite").
format("csv").option("header", "true").
save("src/main/resources/comparePrototype/Test");
Now, my data consists of almost 200k-300k rows and has columns that would have values including but not limited to "null", " ", "NULL", etc. Each meaning that there is no data within the column. My joinDF query should result into a dataset which has all the data that is different from each other in main and compare datasets. For example, I take a CSV:
id,first_name,last_name,email,gender,ip_address,test
1,Aurelia,Wayvill,awayvill0#theatlantic.com,Female,132.57.62.243,NULL
2,Carey,Winfrey,cwinfrey1#soundcloud.com,Male,138.31.65.57,
3,La verne,Jeannel,ljeannel2#ftc.gov,Female,5.171.43.17,null
4,Norry,Sammut,nsammut3#ihg.com,Female,177.59.155.91,
And another CSV to compare it against:
id,first_name,last_name,email,gender,ip_address,test
1,Aurelia,Wayvill,awayvill0#theatlantic.com,Female,132.57.62.243,
2,Carey,Winfrey,cwinfrey1#soundcloud.com,Male,138.31.65.57,
3,La verne,Jeannel,ljeannel2#ftc.gov,Female,5.171.43.17,
4,Norry,Sammut,nsammut3#ihg.com,Female,177.59.155.91,
Running the above code snippet gives me a CSV file that should ideally be empty or I can also work with the case where it only has 3 data rows from ids 1 to 3 since they have different test column values. But, somehow it also contains data row 4 and I don't know how to stop it from comparing it against rows that have empty columns.
Any idea on how to proceed with the same? Or any changes I should do in my SQL query?

how to convert json string to a specific dataframe(DataSet) on spark?

I want to convert Json string to a specific dataframe on spark.
spark can easily get a automaticly generated dataframe:
DataSet<Row> row = sparkSession.read().json(JavaRdd<String>)
but the result is not what I want,because the automatically generated dateframe contains only struct and array,
and the target hive table contains fields in which dataType are map,so i can't write directly the DataSet to hdfs
I can provide the right StructType and Json, and I want get the Row which base on the StructType I provide, the API like the following:
Row transJsonStrToSpecificRow(StructType specificStruct,String json)
Does anyone have the solution?
thanks!

Get a single column values as a flat list in Apache spark using java

I am new to Java and Apache spark and trying to figure out how to get values of a single column from a Dataset in spark as a flat list.
Dataset<Row> sampleData = sparkSession.read()
.....
.option("query", "SELECT COLUMN1, column2 from table1")
.load();
List<Row> columnsList = sampleData.select("COLUMN1")
.where(sampleData.col("COLUMN1").isNotNull()).collectAsList();
String result = StringUtils.join(columnsList, ", ");
// Result I am getting is
[15230321], [15306791], [15325784], [15323326], [15288338], [15322001], [15307950], [15298286], [15327223]
// What i want is":
15230321, 15306791......
How do I achieve this in spark using java?
Spark row can be converted to String by Encoders:
List<String> result = sampleData.select("COLUMN1").as(Encoders.STRING()).collectAsList();
I am pasting the answer in Scala. You can convert it into Java as there are online tools available.
Also I am not creating String result as the way you specified because it would require creating table and doing the query per your process but I am replicating the problem variable directly using
import org.apache.spark.sql.Row
val a = List(Row("123"),Row("222"),Row("333"))
Printing a is giving me
List([123], [222], [333])
So apply a simple map operation along with mkString method to flatten the List
a.map(x => x.mkString(","))
gives
List(123, 222, 333) which I assume is your expectation.
Let me know if this sorts out your issue.

How to deal with a JSON stored in a row of a very large unpartitioned Hive table

I'm using spark SQL (spark 2.1) to read in a hive table.
The schema of the hive table is the following (simplified to the only field that is intesrting related to my problem, the other are useless) :
Body type:Bynary
The body is a JSON with multiple field and the one I'm interested in is an array. In each index of this array I have another JSON that contains a date.
My goal is to obtain a dataset filled with all the object of my array that have a date superior to "insert the wanted date".
To do so I use the following code :
SparkConfig conf = //set the kryo serializer and tungsten at true
SparkSession ss = //set the conf on the spark session
Dataset<String> dataset = creatMyDatasetWithTheGoodType(SS.SQL("select * from mytable "));
Dataset<String> finalds = dataset.flatmap(json->
List<String> l = new ArrayList<>();
List<String> ldate =//i use Jackson to obtain the array of date, this action return a list
For(int i = O; i < ldate.size ; i++) {
//if date is ok i add it to l
}
Return l.iterator()
});
(My code is working on a small dataset I gave it to give an idea of what I was doing)
The problem is that this hive table has like 22 millions lines.
The job turned for 14 hours and didn't finish (I killed it but no error or GC overhead)
I'm running it with yarn-client with 4 executors having 16 go of memory each. The driver has 4 go of memory. 1 core for the executor each.
I used a hdfs dfs -du hiveTableLocationPath and I had like 45 Go as a result.
What can I do to tune my job ?
I recommend to try this UDTF that allows working on json columns within hive
It is then possible to manipulate large json and fetch needed data in a distributed and optimized way.

Put values from Spark RDD to the same HBase column with default timestamp

I'm using Spark and trying to write the RDD to the HBase table.
Here the sample code:
public static void main(String[] args) {
// ... code omitted
JavaPairRDD<ImmutableBytesWritable, Put> hBasePutsRDD = rdd
.javaRDD()
.flatMapToPair(new MyFunction());
hBasePutsRDD.saveAsNewAPIHadoopDataset(job.getConfiguration());
}
private class MyFunction implements
PairFlatMapFunction<Row, ImmutableBytesWritable, Put> {
public Iterable<Tuple2<ImmutableBytesWritable, Put>> call(final Row row)
throws Exception {
List<Tuple2<ImmutableBytesWritable, Put>> puts = new ArrayList<>();
Put put = new Put(getRowKey(row));
String value = row.getAs("rddFieldName");
put.addColumn("CF".getBytes(Charset.forName("UTF-8")),
"COLUMN".getBytes(Charset.forName("UTF-8")),
value.getBytes(Charset.forName("UTF-8")));
return Collections.singletonList(
new Tuple2<>(new ImmutableBytesWritable(getRowKey(row)), put));
}
}
If I manually set the timestamp like this:
put.addColumn("CF".getBytes(Charset.forName("UTF-8")),
"COLUMN".getBytes(Charset.forName("UTF-8")),
manualTimestamp,
value.getBytes(Charset.forName("UTF-8")));
everything works fine and I have as many cell versions in HBase column "COLUMN" as there are number of different values in RDD.
But if I do not, there is only one cell version.
In another words, if there are multiple Put objects with the same column family and column, different values and default timestamp, the only one value will be inserted and another will be omitted (maybe overwritten).
Could you please help me understand how it works (saveAsNewAPIHadoopDataset especially) in this case and how can I modify the code to insert values and do not a timestamp manually.
They are overwritten when you don't use your timestamp. Hbase needs a unique key for every value, so real key for every value is
rowkey + column family + column key + timestamp => value
When you don't use timestamp, and they are inserted as bulk, many of them get same timestamp as hbase can insert multiple rows in same millisecond. So you need a custom timestamp for every same column key values.
I did not understand why you did not want to use custom timestamp as you said it works already. If you think it will use extra space in database, hbase already use timestamp even if you don't give in Put command. So nothing changes when you use manual timestamp, please use it.

Categories