How to generate sequence of numbers for every task in spark - java

I am using below code for mapping some data in spark. I need a unique sequential number to be generated for every task while mapping it to pair rdd. I tried using the accumulators. But I got to know from the exceptions that retrieving the value form an accumulator is not possible inside the task. Please help me on this as I am very new to spark and have no idea about the solution.
Accumulator<Integer> uniqueIdAccumulator = context.getJavaSparkContext().accumulator(0, "uniqueId");
JavaPairRDD<String, String> rdd1 = javaPairRdd.mapToPair(f-> {
return new Tuple2<String,String>(f._1, this.getMessageString(f._2, null,uniqueIdAccumulator.value()));

JavaPairRDD rdd1 = javaPairRdd.zipWithIndex().mapToPair(f-> {
return new Tuple2(f._1._1,this.getMessageString(f._1._2, null, f._2));
There is no need of accumulator here. ZipWithIndex helped getting the solution. ZipWIthIndex returns a RDD with the existing tuple and Long index number. I used the index number to generate the unique sequence number.


Spring - jdbcTemplate failed to get Pair object

Can I get Pair as an output for jdbcTemplate? I tried the following (which work for separate Integers)
Pair<Integer, Integer> result = jdbcTemplate.queryForObject(GET_PAIR, new Object[]{}, Pair.class);
But it returns exception
org.springframework.jdbc.IncorrectResultSetColumnCountException: Incorrect column count: expected 1, actual 2
at org.springframework.jdbc.core.SingleColumnRowMapper.mapRow(
at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(
at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(
at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(
at org.springframework.jdbc.core.JdbcTemplate.execute(
at org.springframework.jdbc.core.JdbcTemplate.query(
at org.springframework.jdbc.core.JdbcTemplate.query(
at org.springframework.jdbc.core.JdbcTemplate.query(
at org.springframework.jdbc.core.JdbcTemplate.queryForObject(
Tried with org.apache.commons.lang3.tuple.Pair
queryForObject requires one result and just one result. So when you get EmptyResultDataAccessException it means that query for Object didn't find anything.
However I still don't think it will work, even if you get a result. A better way is to use a RowMapper.
jdbcTemplate.query(GET_PAIR, (rs, i) -> new Pair(rs.getInt(1), rs.getInt(2)))
Which will allow you to map the elements to a pair (this will return a list, one for each row).

How to access the entries in every row and apply custom functions?

My input was a kafka-stream with only one value which is comma-separated. It looks like this.
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
|id |country |timestamp |
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
In the end i want to sink out again a comma-separated String.
Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset
If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.

Can not modify value in JavaRDD

I have a question about how to update JavaRDD values.
I have a JavaRDD<CostedEventMessage> with message objects containing information about to which partition of kafka topic it should be written to.
I'm trying to change the partitionId field of such objects using the following code:
rddToKafka = -> repartitionEvent(event, numPartitions));
where the repartitionEvent logic is:
return costedEventMessage;
But the modification does not happen.
Could you please advice why and how to correctly modify values in a JavaRDD?
Spark is lazy, so from the code you pasted above it's not clear if you actually performed any action on the JavaRDD (like collect or forEach) and how you came to the conclusion that data was not changed.
For example, if you assumed that by running the following code:
List<CostedEventMessage> messagesLst = ...;
JavaRDD<CostedEventMessage> rddToKafka = javaSparkContext.parallelize(messagesLst);
rddToKafka = -> repartitionEvent(event, numPartitions));
Each element in messagesLst would have partition set to 1, you are wrong.
That would hold true if you added for example:
messagesLst = rddToKafka.collect();
For more details refer to documentation

JavaPairRDD to Dataset<Row> in SPARK

I have data in JavaPairRDD in format
JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>
I tried using below code
Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???
Try to run printSchema - you will see, that value2 is a complex type.
Having such information, you can write:
Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")
value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only
Note that this will work after is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value

HashMap: Find next lower key

I am currently storing marshalling libraries for different client versions in a HashMap.
The libs are loaded using the org.reflections API. For simplicity sake I'll just insert a few values here by hand. They are unordered by intent, because I have no influence on in which order the map is initialized on start-up by the reflections API.
The keys (ClientVersion) are enums.
HashMap<ClientVersion, IMarshalLib> MAP = new HashMap<>();
MAP.put(ClientVersion.V100, new MarshalLib100());
MAP.put(ClientVersion.V110, new MarshalLib110());
MAP.put(ClientVersion.V102, new MarshalLib102());
MAP.put(ClientVersion.V101, new MarshalLib101());
MAP.put(ClientVersion.V150, new MarshalLib150());
All and well so far, the problem now is, that there are client versions out there where the marshalling did not change since the previous version.
Let's say, we have a client version ClientVersion.V140. In this particular case I am looking for MarshalLib110, assigned to ClientVersion.V110.
How would I get the desired result (without iterating through all entries and grabbing "the next lower" value each time)?
Thanks in advance!
How would I get the desired result (without iterating through all entries and grabbing "the next lower" value each time)
There is nothing you can do about "iterating through all entries" part: since the map is unordered, there is no way of finding the next smaller item without iterating the entire set of keys.
However, there is something you can do about the "each time" part: if you make a copy of this map into a TreeMap, you would be able to look up the next smaller item by calling the floorEntry method.
Another alternative is to copy the keys into an array on the side, sort the array, and run a binary search each time that you need to look up the next smaller key. With the key in hand, you can look up the entry in your hash map.
I recommend you to use NavigableSet. Look at this example:
HashMap<Integer, String> map = new HashMap<>();
map.put(100, "MarshalLib100");
map.put(110, "MarshalLib110");
map.put(102, "MarshalLib102");
map.put(101, "MarshalLib101");
map.put(150, "MarshalLib150");
NavigableSet<Integer> set = new TreeSet<>(map.keySet());
Integer key = set.lower(150); // ^ -> 110
String val = map.get(key); // ^ -> MarshalLib110
// or
key = set.higher(110);// ^ -> 150
val = map.get(key); // ^ -> MarshalLib150
Update: Using TreeMap to find next lower key is not really correct. Example:
TreeMap<Integer, String> treeMap = new TreeMap<Integer, String>();
treeMap.put(100, "MarshalLib100");
treeMap.put(110, "MarshalLib110");
treeMap.put(102, "MarshalLib102");
treeMap.put(101, "MarshalLib101");
treeMap.put(150, "MarshalLib150");
