How to calculate difference between current and previous row in Spark JavaRDD - java

I parsed .log file to JavaRDD, after sorted this JavaRDD and now I have, for example oldJavaRDD:
2016-03-28 | 11:00 | X | object1 | region1
2016-03-28 | 11:01 | Y | object1 | region1
2016-03-28 | 11:05 | X | object1 | region1
2016-03-28 | 11:09 | X | object1 | region1
2016-03-28 | 11:00 | X | object2 | region1
2016-03-28 | 11:01 | Z | object2 | region1
How I can get newJavaRDD for saving it to DB?
New JavaRDD structure have to be:
2016-03-28 | 9 | object1 | region1
2016-03-28 | 1 | object2 | region1
so, I have to calculate time between current and previous row (also use flag X, Y, Z in some cases to define, add time to result or not) and add new element to JavaRDD after changing date, objectName or objectRegion.
I can do it using this type of code (map), but I think it's not good and not the fastest way
JavaRDD<NewObject> newJavaRDD = oldJavaRDD.map { r ->
String datePrev[] = ...
if (datePrev != dateCurr ...) {
return newJavaRdd;
} else {
return null;
}
}

First, your code example references newJavaRDD from within a transformation that creates newJavaRDD - that's impossible on a few different levels:
You can't reference a variable on the right-hand-side of that variable's declaration...
You can't use an RDD within a transformation on an RDD (same one or another one - that doesn't matter) - anything inside a transformation must be serialized by Spark, and Spark can't serialize its own RDDs (that would make no sense)
So, how should you do that?
Assuming:
Your intention here is to get a single record for each combination of date + object + region
There shouldn't be too many records for each such combination, so it's safe to groupBy these fields as key
You can groupBy the key fields, and then mapValues to get the "minute distnace" between first and last record (the function passed to mapValues can be changed to contain your exact logic if I didn't get it right). I'll use Joda Time library for the time calculations:
public static void main(String[] args) {
// some setup code for this test:
JavaSparkContext sc = new JavaSparkContext("local", "test");
// input:
final JavaRDD<String[]> input = sc.parallelize(Lists.newArrayList(
// date time ? object region
new String[]{"2016-03-28", "11:00", "X", "object1", "region1"},
new String[]{"2016-03-28", "11:01", "Y", "object1", "region1"},
new String[]{"2016-03-28", "11:05", "X", "object1", "region1"},
new String[]{"2016-03-28", "11:09", "X", "object1", "region1"},
new String[]{"2016-03-28", "11:00", "X", "object2", "region1"},
new String[]{"2016-03-28", "11:01", "Z", "object2", "region1"}
));
// grouping by key:
final JavaPairRDD<String, Iterable<String[]>> byObjectAndDate = input.groupBy(new Function<String[], String>() {
#Override
public String call(String[] record) throws Exception {
return record[0] + record[3] + record[4]; // date, object, region
}
});
// mapping each "value" (all record matching key) to result
final JavaRDD<String[]> result = byObjectAndDate.mapValues(new Function<Iterable<String[]>, String[]>() {
#Override
public String[] call(Iterable<String[]> records) throws Exception {
final Iterator<String[]> iterator = records.iterator();
String[] previousRecord = iterator.next();
int diffMinutes = 0;
for (String[] record : records) {
if (record[2].equals("X")) { // if I got your intention right...
final LocalDateTime prev = getLocalDateTime(previousRecord);
final LocalDateTime curr = getLocalDateTime(record);
diffMinutes += Period.fieldDifference(prev, curr).toStandardMinutes().getMinutes();
}
previousRecord = record;
}
return new String[]{
previousRecord[0],
Integer.toString(diffMinutes),
previousRecord[3],
previousRecord[4]
};
}
}).values();
// do whatever with "result"...
}
// extracts a Joda LocalDateTime from a "record"
static LocalDateTime getLocalDateTime(String[] record) {
return LocalDateTime.parse(record[0] + " " + record[1], formatter);
}
static final DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm");
P.S. In Scala this would take about 8 lines... :/

Related

Scala - how to format a collection into a String?

I'm trying to parse Metrics data into a formatted String so that there is a header and each record below starts from a new line. Initially I wanted to get something close to a table formatting like this:
Id | Name | Rate | Value
1L | Name1 | 1 | value_1
2L | Name2 | 2 | value_2
3L | Name3 | 3 | value_3
But my current implementation results in the following Error:
java.util.MissingFormatArgumentException: Format specifier '%-70s'
What should I change in my code to get it formatted correctly?
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
case class BaseMetric(val id: Long,
val name: String,
val rate: String,
val value: String,
val count: Long,
val isValid: Boolean
) {
def makeCustomMetric: String = Seq(id, name, rate, value).mkString("\t")
}
val metric1 = new BaseMetric(1L, "Name1", "1", "value_1", 10L, true)
val metric2 = new BaseMetric(2L, "Name2", "2", "value_2", 20L, false)
val metric3 = new BaseMetric(3L, "Name3", "3", "value_3", 30L, true)
val metrics = Seq(metric1, metric1, metric1)
def formatMetrics(metrics: Seq[BaseMetric]): String = {
val pattern = "%-50s | %-70s | %-55s | %-65s | %f"
val formattedMetrics: String = pattern.format(metrics.map(_.makeCustomMetric))
.mkString("Id | Name | Rate | Value\n", "\n", "\nId | Name | Rate | Value")
formattedMetrics
}
val metricsString = formatMetrics(metrics)
The specific error is due to the fact that you pass a Seq[String] to format which expects Any*. You only pass one parameter instead of five. The error says it doesn't find an argument for your second format string.
You want to apply the pattern on every metric, not all the metrics on the pattern.
The paddings in the format string are too big for what you want to achieve.
val pattern = "%-2s | %-5s | %-4s | %-6s"
metrics.map(m => pattern.format(m.makeCustomMetric: _*))
.mkString("Id | Name | Rate | Value\n", "\n", "\nId | Name | Rate | Value")
The _* tells the compiler that you want to pass a list as variable length argument.
makeCustomMetric should return only the List then, instead of a string.
def makeCustomMetric: String = Seq(id, name, rate, value)
Scala string interpolation is the optimized way to concat/foramt strings.
Reference: https://docs.scala-lang.org/overviews/core/string-interpolation.html
s"id: $id ,name: $name ,rate: $rate ,value: $value ,count: $count, isValid: $isValid"

How to put a list of data table in a list of objects

I have a data table in my feature file, which I want to convert to a list of objects. The problem is data table has headers, which are supposed to be set in the value of objects. As an example:
| ANNOTATION_TYPE_ID | ANNOTATION_SUBTYPE_ID | PAGE_NB | LEFT_NB | TOP_NB | WIDTH_NB | HEIGHT_NB | FONTSIZE_NB | COLOR_X | ANNOTATION_TEXT_X |
| 1 | 1 | 1 | 400 | 200 | 88 | 38 | 15 | FFFFFF | TEST Annotation |
| 2 | 2 | 1 | 150 | 150 | 88 | 38 | 20 | FFFFF0 | TEST Annotation |
This I want to convert to a list of objects as List annotations where Annotation is a class and the headers of the above data table are essentially the field variables inside the class.
What is the efficient way to do this?
The moment I convert data table to list (List<String> annotationList = annotation.asList(String.class)), it becomes a big set and how to group them is what I am struggling with?
One approach would be to look at this as a list of annotations, with each annotation having a set of key/value pairs based upon each row in your file. It would look like List of HashMaps where each HashMap key is the row header value and the value is the row value. This may not be the most efficient approach depending upon your usage. Here's sample code that was able to parse the data you provided - it produces a List with two items, each has a HashMap with the number of key/values for the columns above. Good luck.
public static void main(String[] args) {
Path filePath = Path.of("C:\\tmp\\anno");
if (!Files.exists(filePath)) {
System.out.println("File does not exist at '" + filePath + "'");
System.exit(1);
}
List<HashMap<String, String>> annotations = new ArrayList<HashMap<String, String>>();
try {
List<String> annoFile = Files.readAllLines(filePath);
List<String> headers = Arrays.asList(annoFile.remove(0).split("\\|"));
headers.forEach(System.out::println);
while (annoFile.size() > 0) {
List<String> rowValues = Arrays.asList(annoFile.remove(0).split("\\|"));
HashMap<String, String> annotation = new HashMap<>();
for (int i = 0; i < headers.size(); i++) {
if (rowValues.size() >= i) {
annotation.put(headers.get(i).strip(), rowValues.get(i).strip());
}
}
annotations.add(annotation);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Spark and non-denormalized tables

I know Spark works much better with denormalized tables, where all the needed data is in one line. I wondering, if it is not the case, it would have a way to retrieve data from previous, or next, rows.
Example:
Formula:
value = (value from 2 year ago) + (current year value) / (value from 2 year ahead)
Table
+-------+-----+
| YEAR|VALUE|
+-------+-----+
| 2015| 100 |
| 2016| 34 |
| 2017| 32 |
| 2018| 22 |
| 2019| 14 |
| 2020| 42 |
| 2021| 88 |
+-------+-----+
Dataset<Row> dataset ...
Dataset<Results> results = dataset.map(row -> {
int currentValue = Integer.valueOf(row.getAs("VALUE")); // 2019
// non sense code just to exemplify
int twoYearsBackValue = Integer.valueOf(row[???].getAs("VALUE")); // 2016
int twoYearsAheadValue = Integer.valueOf(row[???].getAs("VALUE")); // 2021
double resultValue = twoYearsBackValue + currentValue / twoYearsAheadValue;
return new Result(2019, resultValue);
});
Results[] results = results.collect();
Is it possible to grab these values (that belongs to other rows) without changing the table format (no denormalization, no pivots ...) and also without collecting the data, or does it go totally against Spark/BigData principles?

How to iterate grouped data in spark?

I have a dataset like this:
uid group_a group_b
1 3 unkown
1 unkown 4
2 unkown 3
2 2 unkown
I want to get the result:
uid group_a group_b
1 3 4
2 2 3
I try to group the data by "uid" and iterate each group and select the not-unkown value as the final value, but don't know how to do it.
I would suggest you define a User Defined Aggregation Function (UDAF)
Using inbuilt functions are great ways but they are difficult to be customized. If you own a UDAF then it is customizable and you can edit it according to your needs.
Concerning your problem, following can be your solution. You can edit it according to your needs.
First task is to define a UDAF
class PingJiang extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("group_a", StringType).add("group_b", StringType)
def bufferSchema = new StructType().add("buff0", StringType).add("buff1", StringType)
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, "")
buffer.update(1, "")
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val buff = buffer.getString(0)
val groupa = input.getString(0)
val groupb = input.getString(1)
if(!groupa.equalsIgnoreCase("unknown")){
buffer.update(0, groupa)
}
if(!groupb.equalsIgnoreCase("unknown")){
buffer.update(1, groupb)
}
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getString(0)+buffer2.getString(0)
val buff2 = buffer1.getString(1)+buffer2.getString(1)
buffer1.update(0, buff1+","+buff2)
}
def evaluate(buffer: Row) : String = {
buffer.getString(0)
}
}
Then you call it from your main class and do some manipulations to get the result you need as
val data = Seq(
(1, "3", "unknown"),
(1, "unknown", "4"),
(2, "unknown", "3"),
(2, "2", "unknown"))
.toDF("uid", "group_a", "group_b")
val udaf = new PingJiang()
val result = data.groupBy("uid").agg(udaf($"group_a", $"group_b").as("ping"))
.withColumn("group_a", split($"ping", ",")(0))
.withColumn("group_b", split($"ping", ",")(1))
.drop("ping")
result.show(false)
Visit databricks and augmentiq for better understanding of UDAF
Note : The above solution gets you the latest value for each group if present (You can always edit according to your needs)
After you format the dataset to a PairRDD you can use the reduceByKey operation to find the single known value. The following example assumes that there is only one known value per uid or otherwise returns the first known value
val input = List(
("1", "3", "unknown"),
("1", "unknown", "4"),
("2", "unknown", "3"),
("2", "2", "unknown")
)
val pairRdd = sc.parallelize(input).map(l => (l._1, (l._2, l._3)))
val result = pairRdd.reduceByKey { (a, b) =>
val groupA = if (a._1 != "unknown") a._1 else b._1
val groupB = if (a._2 != "unknown") a._2 else b._2
(groupA, groupB)
}
The result will be a pairRdd that looks like this
(uid, (group_a, group_b))
(1,(3,4))
(2,(2,3))
You can return to the plain line format with a simple map operation.
You could replace all "unknown" values by null, and then use the function first() inside a map (as shown here), to get the first non-null values in each column per group:
import org.apache.spark.sql.functions.{col,first,when}
// We are only gonna apply our function to the last 2 columns
val cols = df.columns.drop(1)
// Create expression
val exprs = cols.map(first(_,true))
// Putting it all together
df.select(df.columns
.map(c => when(col(c) === "unknown", null)
.otherwise(col(c)).as(c)): _*)
.groupBy("uid")
.agg(exprs.head, exprs.tail: _*).show()
+---+--------------------+--------------------+
|uid|first(group_1, true)|first(group_b, true)|
+---+--------------------+--------------------+
| 1| 3| 4|
| 2| 2| 3|
+---+--------------------+--------------------+
Data:
val df = sc.parallelize(Array(("1","3","unknown"),("1","unknown","4"),
("2","unknown","3"),("2","2","unknown"))).toDF("uid","group_1","group_b")

Mongodb select all fields group by one field and sort by another field

We have collection 'message' with following fields
_id | messageId | chainId | createOn
1 | 1 | A | 155
2 | 2 | A | 185
3 | 3 | A | 225
4 | 4 | B | 226
5 | 5 | C | 228
6 | 6 | B | 300
We want to select all fields of document with following criteria
distict by field 'chainId'
order(sort) by 'createdOn' in desc order
so, the expected result is
_id | messageId | chainId | createOn
3 | 3 | A | 225
5 | 5 | C | 228
6 | 6 | B | 300
We are using spring-data in our java application. I tried to go with different approaches, nothing helped me so far.
Is it possible to achieve above with single query?
What you want is something that can be achieved with the aggregation framework. The basic form of ( which is useful to others ) is:
db.collection.aggregate([
// Group by the grouping key, but keep the valid values
{ "$group": {
"_id": "$chainId",
"docId": { "$first": "$_id" },
"messageId": { "$first": "$messageId" },
"createOn": { "$first": "$createdOn" }
}},
// Then sort
{ "$sort": { "createOn": -1 } }
])
So that "groups" on the distinct values of "messageId" while taking the $first boundary values for each of the other fields. Alternately if you want the largest then use $last instead, but for either smallest or largest by row it probably makes sense to $sort first, otherwise just use $min and $max if the whole row is not important.
See the MongoDB aggregate() documentation for more information on usage, as well as the driver JavaDocs and SpringData Mongo connector documentation for more usage of the aggregate method and possible helpers.
here is the solution using MongoDB Java Driver
final MongoClient mongoClient = new MongoClient();
final DB db = mongoClient.getDB("mstreettest");
final DBCollection collection = db.getCollection("message");
final BasicDBObject groupFields = new BasicDBObject("_id", "$chainId");
groupFields.put("docId", new BasicDBObject("$first", "$_id"));
groupFields.put("messageId", new BasicDBObject("$first", "$messageId"));
groupFields.put("createOn", new BasicDBObject("$first", "$createdOn"));
final DBObject group = new BasicDBObject("$group", groupFields);
final DBObject sortFields = new BasicDBObject("createOn", -1);
final DBObject sort = new BasicDBObject("$sort", sortFields);
final DBObject projectFields = new BasicDBObject("_id", 0);
projectFields.put("_id", "$docId");
projectFields.put("messageId", "$messageId");
projectFields.put("chainId", "$_id");
projectFields.put("createOn", "$createOn");
final DBObject project = new BasicDBObject("$project", projectFields);
final AggregationOutput aggregate = collection.aggregate(group, sort, project);
and the result will be:
{ "_id" : 5 , "messageId" : 5 , "createOn" : { "$date" : "2014-04-23T04:45:45.173Z"} , "chainId" : "C"}
{ "_id" : 4 , "messageId" : 4 , "createOn" : { "$date" : "2014-04-23T04:12:25.173Z"} , "chainId" : "B"}
{ "_id" : 1 , "messageId" : 1 , "createOn" : { "$date" : "2014-04-22T08:29:05.173Z"} , "chainId" : "A"}
I tried it with SpringData Mongo and it didn't work when I group it by chainId(java.lang.NumberFormatException: For input string: "C") was the exception
Replace this line:
final DBObject group = new BasicDBObject("$group", groupFields);
with this one:
final DBObject group = new BasicDBObject("_id", groupFields);
here is the solution using springframework.data.mongodb:
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.group("chainId"),
Aggregation.sort(new Sort(Sort.Direction.ASC, "createdOn"))
);
AggregationResults<XxxBean> results = mongoTemplate.aggregate(aggregation, "collection_name", XxxBean.class);

Categories