I am trying to compare two datasets after loading it using Apache Spark:
final SparkSession sparkSession=SparkSession.builder().appName("Final Project").master("local[3]").getOrCreate();
final DataFrameReader reader = sparkSession.read();
reader.option("header", "true");
Dataset<Row> mainDF = reader.csv(mainFile);
Dataset<Row> compareDF = reader.csv(compareFile);
mainDF.createOrReplaceTempView("main");
compareDF.createOrReplaceTempView("compare");
Dataset<Row> joinDF = sparkSession.sql("SELECT * FROM (SELECT 'main' AS main, main.* FROM main) main NATURAL FULL JOIN (SELECT 'compare' AS compare, compare.* FROM compare) compare WHERE main IS NULL OR compare IS NULL");
joinDF.coalesce(1).write().mode("overwrite").
format("csv").option("header", "true").
save("src/main/resources/comparePrototype/Test");
Now, my data consists of almost 200k-300k rows and has columns that would have values including but not limited to "null", " ", "NULL", etc. Each meaning that there is no data within the column. My joinDF query should result into a dataset which has all the data that is different from each other in main and compare datasets. For example, I take a CSV:
id,first_name,last_name,email,gender,ip_address,test
1,Aurelia,Wayvill,awayvill0#theatlantic.com,Female,132.57.62.243,NULL
2,Carey,Winfrey,cwinfrey1#soundcloud.com,Male,138.31.65.57,
3,La verne,Jeannel,ljeannel2#ftc.gov,Female,5.171.43.17,null
4,Norry,Sammut,nsammut3#ihg.com,Female,177.59.155.91,
And another CSV to compare it against:
id,first_name,last_name,email,gender,ip_address,test
1,Aurelia,Wayvill,awayvill0#theatlantic.com,Female,132.57.62.243,
2,Carey,Winfrey,cwinfrey1#soundcloud.com,Male,138.31.65.57,
3,La verne,Jeannel,ljeannel2#ftc.gov,Female,5.171.43.17,
4,Norry,Sammut,nsammut3#ihg.com,Female,177.59.155.91,
Running the above code snippet gives me a CSV file that should ideally be empty or I can also work with the case where it only has 3 data rows from ids 1 to 3 since they have different test column values. But, somehow it also contains data row 4 and I don't know how to stop it from comparing it against rows that have empty columns.
Any idea on how to proceed with the same? Or any changes I should do in my SQL query?
Related
I have loaded a parquet file into a Dataset<Row> in java and I want to iterate it record/row wise and read value of every attribute in that row.
I have got till here
Dataset<Row> df = sparkSession.read().format("parquet").load(location);
df.foreach((ForeachFunction<Row>) row -> {
System.out.println(row);
});
Is there any function for it in java to read attributes of a given row?
PS : I am using java 11 and spark 2.4.0.
Well its a little tricky for java, as java doesn't support searching a column directly by its name, but we can get a column based on index.
Also there is a function which returns the field-index of a given column name. Combining both we can iterate over the attributes of DataSet<Row>
Sample code:
Dataset<Row> df = sparkSession.read().format("parquet").load(location);
df.foreach((ForeachFunction<Row>) row -> {
System.out.println((String) row.get(row.fieldIndex("attribute_name")));
});
I am new to Java and Apache spark and trying to figure out how to get values of a single column from a Dataset in spark as a flat list.
Dataset<Row> sampleData = sparkSession.read()
.....
.option("query", "SELECT COLUMN1, column2 from table1")
.load();
List<Row> columnsList = sampleData.select("COLUMN1")
.where(sampleData.col("COLUMN1").isNotNull()).collectAsList();
String result = StringUtils.join(columnsList, ", ");
// Result I am getting is
[15230321], [15306791], [15325784], [15323326], [15288338], [15322001], [15307950], [15298286], [15327223]
// What i want is":
15230321, 15306791......
How do I achieve this in spark using java?
Spark row can be converted to String by Encoders:
List<String> result = sampleData.select("COLUMN1").as(Encoders.STRING()).collectAsList();
I am pasting the answer in Scala. You can convert it into Java as there are online tools available.
Also I am not creating String result as the way you specified because it would require creating table and doing the query per your process but I am replicating the problem variable directly using
import org.apache.spark.sql.Row
val a = List(Row("123"),Row("222"),Row("333"))
Printing a is giving me
List([123], [222], [333])
So apply a simple map operation along with mkString method to flatten the List
a.map(x => x.mkString(","))
gives
List(123, 222, 333) which I assume is your expectation.
Let me know if this sorts out your issue.
I'm using spark SQL (spark 2.1) to read in a hive table.
The schema of the hive table is the following (simplified to the only field that is intesrting related to my problem, the other are useless) :
Body type:Bynary
The body is a JSON with multiple field and the one I'm interested in is an array. In each index of this array I have another JSON that contains a date.
My goal is to obtain a dataset filled with all the object of my array that have a date superior to "insert the wanted date".
To do so I use the following code :
SparkConfig conf = //set the kryo serializer and tungsten at true
SparkSession ss = //set the conf on the spark session
Dataset<String> dataset = creatMyDatasetWithTheGoodType(SS.SQL("select * from mytable "));
Dataset<String> finalds = dataset.flatmap(json->
List<String> l = new ArrayList<>();
List<String> ldate =//i use Jackson to obtain the array of date, this action return a list
For(int i = O; i < ldate.size ; i++) {
//if date is ok i add it to l
}
Return l.iterator()
});
(My code is working on a small dataset I gave it to give an idea of what I was doing)
The problem is that this hive table has like 22 millions lines.
The job turned for 14 hours and didn't finish (I killed it but no error or GC overhead)
I'm running it with yarn-client with 4 executors having 16 go of memory each. The driver has 4 go of memory. 1 core for the executor each.
I used a hdfs dfs -du hiveTableLocationPath and I had like 45 Go as a result.
What can I do to tune my job ?
I recommend to try this UDTF that allows working on json columns within hive
It is then possible to manipulate large json and fetch needed data in a distributed and optimized way.
I want to join multiple datasets that have some columns with same name while having different data. This is possible to rename dataset columns while conversion to dataframe. But is it possible to use rename or setting prefix to column names while using datasets.
Dataset<Row> uct = spark.read().jdbc(jdbcUrl, "uct", connectionProperties);
Dataset<Row> si = spark.read().jdbc(jdbcUrl, "si", connectionProperties).filter("status = 'ACTIVE'");
Dataset<Row> uc = uct.join(si, uct.col("service_id").equalTo(si.col("id")))
uc will have columns with same name 'code' then it will be difficult to get value of code from either uct.code or si.code
Dataframe is an alias for Dataset. So practically you are using a dataframe in your code. If you want to retain both the columns with the same name, then you will have to rename one of the columns before performing join using "withColumnRenamed" option.
I am using mapreduce and HfileOutputFormat to produce hfiles and bulk load them directly into the hbase table.
Now, while reading the input files, I want to produce hfiles for two tables and bulk load the outputs in a single mapreduce.
I searched the web and see some links about MultiHfileOutputFormat and couldn't find a real solution to that.
Do you think that it is possible?
My way is :
use HFileOutputFormat as well, when the job is completed , doBulkLoad, write into table1.
set a List puts in mapper, and a MAX_PUTS value in global.
when puts.size()>MAX_PUTS, do:
String tableName = conf.get("hbase.table.name.dic", table2);
HTable table = new HTable(conf, tableName);
table.setAutoFlushTo(false);
table.setWriteBufferSize(1024*1024*64);
table.put(puts);
table.close();
puts.clear();
notice:you mast have a cleanup function to write the left puts .