How to combine two JavapairRDD to a custom JavapairRDD? - java

I have created the following JavaPairRdds from the data received from different API endpoints.
listHeaderRDD<Integer, TeachersList> -> {list_id, list_details}
e.g {1,{list_id:1,name:"abc",quantity:"2"}},
{2,{list_id:2,name:"xyz",quantity:"5"}}...
ItemsGroupListRDD<Integer, Iterable<Tuple2<Integer, TeachersListItem>>> ->
{list_id, {{item_id1,item_details1},{item_id2,item_details2}..}}
e.g {1, {{11,{item_id:11,item_name:"abc"}},{12,{item_id:12,item_name:"acv"}}}..}
{2, {{14,{item_id:14,item_name:"bnh"}},{18,{item_id:18,item_name:"hjk"}}}..}
Desired output:
teachersListRDD<TeachersList, Iterable<TeachersListItem>> -> {list_details, all_item_details}
e.g {{{list_id:1,name:"abc",quantity:"2"},{{item_id:11,item_name:"abc"},{item_id:12,item_name:"acv"}}},
{{list_id:2,name:"xyz",quantity:"5"},{{item_id:14,item_name:"bnh"},{item_id:18,item_name:"hjk"}}}
}
Basically I want the value of first RDD to be the key in the desired RDD and the group of item_details from the second RDD corresponding to that list_id as the value for the desired RDD i.e teachersListRDD
I have tried different ways to do it but unable to get the desired output.

Related

explode an spark array column to multiple columns sparksql

I have a column which has type Value defined like below
val Value: ArrayType = ArrayType(
new StructType()
.add("unit", StringType)
.add("value", StringType)
)
and data like this
[[unit1, 25], [unit2, 77]]
[[unit2, 100], [unit1, 40]]
[[unit2, 88]]
[[unit1, 33]]
I know spark sql can use functions.explode to make the data become multiple rows, but what i want is explode to multiple columns (or the 1 one column but 2 items for the one has only 1 item).
so the end result looks like below
unit1 unit2
25 77
40 100
value1 88
33 value2
How could I achieve this?
addtion after initial post and update
I want to get result like this (this is more like my final goal).
transformed-column
[[unit1, 25], [unit2, 77]]
[[unit2, 104], [unit1, 40]]
[[unit1, value1], [unit2, 88]]
[[unit1, 33],[unit2,value2]]
where value1 is the result of applying some kind of map/conversion function using the [unit2, 88]
similarly, value2 is the result of applying the same map /conversion function using the [unit1, 33]
I solved this problem using the map_from_entries as suggested by #jxc, and then used UDF to convert the map of 1 item to map of 2 items, using business logic to convert between the 2 units.
one thing to note is the map returned from map_from_entries is scala map. and if you use java, need to make sure the udf method takes scala map instead.
ps. maybe I did not have to use map_from_entries, instead maybe i could make the UDF to take array of structType

Spring - jdbcTemplate failed to get Pair object

Can I get Pair as an output for jdbcTemplate? I tried the following (which work for separate Integers)
Pair<Integer, Integer> result = jdbcTemplate.queryForObject(GET_PAIR, new Object[]{}, Pair.class);
But it returns exception
org.springframework.jdbc.IncorrectResultSetColumnCountException: Incorrect column count: expected 1, actual 2
at org.springframework.jdbc.core.SingleColumnRowMapper.mapRow(SingleColumnRowMapper.java:92)
at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:93)
at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:60)
at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(JdbcTemplate.java:703)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:639)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:690)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:722)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:732)
at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:800)
Tried with org.apache.commons.lang3.tuple.Pair
queryForObject requires one result and just one result. So when you get EmptyResultDataAccessException it means that query for Object didn't find anything.
However I still don't think it will work, even if you get a result. A better way is to use a RowMapper.
jdbcTemplate.query(GET_PAIR, (rs, i) -> new Pair(rs.getInt(1), rs.getInt(2)))
Which will allow you to map the elements to a pair (this will return a list, one for each row).

JavaPairRDD to Dataset<Row> in SPARK

I have data in JavaPairRDD in format
JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>
I tried using below code
Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???
Try to run printSchema - you will see, that value2 is a complex type.
Having such information, you can write:
Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")
value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only
Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value

How can I extract the values that don't match when joining two RDD's in Spark?

I have two sets of RDD's that look like this:
rdd1 = [(12, abcd, lmno), (45, wxyz, rstw), (67, asdf, wert)]
rdd2 = [(12, abcd, lmno), (87, whsh, jnmk), (45, wxyz, rstw)]
I need to create a new RDD that has all the values found in rdd2 that don't have corresponding matches in rdd1. So the created RDD should contain the following data:
rdd3 = [(87, whsh, jnmk)]
Does anyone know how to accomplish this?
You can do a full join and then create 2 new RDDs.
Select where both tables had records
Select where RDD2 PK is present and RDD1 is null
You'll need to first convert them to KV rdds. Sample code below:
rdd3 = rdd1.fullJoin(rdd2).filter(x => x._3.exists).map(x => (x._1, x._3.get))
(Yes, there is a more idiomatic way to get the option but this should work)

Spark: After CollectAsMap() or Collect(), every entry has same value

I need to read a text file and change this file to Map.
When I make JavaPairRDD, it works well.
However, when I change JavaPairRDD to Map, every entry has same value, more specifically the last data in text file.
inputFile:
1 book:3000
2 pencil:1000
3 coffee:2500
When I read a text file, I used Hadoop custom input format.
Using this format, Key is number and Value is custom class Expense<content,price>.
JavaPairRDD<Integer,Expense> inputRDD = JavaSparkContext.newAPIHadoopFile(inputFile, ExpenseInputFormat.class, Integer.class, Expense.class, HadoopConf);
inputRDD:
[1, (book,3000)]
[2, (pencil,1000)]
[3, (coffee,2500)]
However, when I do
Map<Integer,Expense> inputMap = new HashMap<Integer,Expense>(inputRDD.collectAsMap());
inputMap:
[1, (coffee,2500)]
[2, (coffee,2500)]
[3, (coffee,2500)]
As we can see, key is correctly inserted, but every value is the last value of input.
I don't know why it happens..

Categories