Spark Java Aggregation

Spark Java Aggregation - java

I'm trying Spark with Java and MongoDB and I want to aggregate some Documents into a single one based on timestamps. For example, I want to aggregate X documents into a single one:
{
"_id" : ObjectId("598c32f455f0353f9e69ebf1"),
"_class" : "...",
"timestamp" : ISODate("2017-08-10T10:17:00.000Z"),
"value" : 10.1
}
...
{
"_id" : ObjectId("598c32f455f0353f9e69ebz2"),
"_class" : "...",
"timestamp" : ISODate("2017-08-10T10:18:00.000Z"),
"value" : 2.1
}
Lets say I have 60 documents like this and their timestamps are in a window of 1 minute (from 10:17:00 to 10:18:00) and I want to obtain one document:
{
"_id" : ObjectId("598c32f455f0353f9e69e231"),
"_class" : "...",
"start_timestamp" : ISODate("2017-08-10T10:17:00.000Z"),
"end_timestamp" : ISODate("2017-08-10T10:18:00.000Z"),
"average_value" : **average value of those documents**
}
Is it possible to perform this kind transformation? Can I retrieve one window of 1 minute of data at a time?
An approach which takes all the documents and compare their timestamps looks slow and inefficient.
Thanks in advance.

Related

Json to csv conversion in Spring boot

I have a csv structure like this
and I also have one json response
[
{
"ID" : "1",
"Name" : "abc",
"Mobile" : "123456"
},
{
"ID" : "2",
"Name" : "cde",
"Mobile" : "123345"
}
]
I need the output like this

If your intention is to convert directly the JSON then that baeldung solution that you were given is good.
Otherwise, the way i see it and based on the info you're giving, you will need to have a representation of that JSON in a java object that will either represent some kind of request coming from somewhere or data you're getting from your database in order to be written on a csv.
Check out these, might be useful:
https://www.codejava.net/frameworks/spring-boot/csv-export-example
https://zetcode.com/springboot/csv/

MongoDb Java Driver toJson() and $oid

I'm building a Java Jersey API which uses MongoDb and MongoDb driver.
The resources should output JSON of the stored MongoDb document to be used in the frontend project using Svelte.
Due to the standard org.bson.Document.toJson() implementation the output of my documents look somehow like:
[{ "_id" : { "$oid" : "5e97f08f2175aa9174dbec0e" }, "hour" : 8, "minute" : 15, "enabled" : true, "duration" : 120 }
I would rather like it to be:
[{ "_id" : "5e97f08f2175aa9174dbec0e", "hour" : 8, "minute" : 15, "enabled" : true, "duration" : 120 }
That way it's easier to handle the id in the frontend. So how to get rid of the $oid object?
I already managed to get the format as I wish by using:
JsonWriterSettings settings = JsonWriterSettings.builder()
.outputMode(JsonMode.RELAXED)
.objectIdConverter((value, writer) -> writer.writeString(value.toHexString()))
.build();
System.out.println(doc.toJson(settings));
But how to register this setting object globally so that every doc.toJson() call will use it?
And what will happen if I send modified or new documents from the frontend to the API and do:
Document document = Document.parse(doc);
Is my modified _id field automatically converted again to an ObjectId? Or do I need a org.bson.codecs.Decoder or CodecRegistry? How would this be done?

$oid refers to ObjectId field type in bson spec. As far as I know, you need to manipulate your document to replace ObjectId for your _id into String.
String oidAsString = document.getObjectId("_id").toString();
document.put("_id", oidAsString);

mongodb java driver pullByFilter

I have document schema such as
{
"_id" : 18,
"name" : "Verdell Sowinski",
"scores" : [
{
"type" : "exam",
"score" : 62.12870233109035
},
{
"type" : "quiz",
"score" : 84.74586220889356
},
{
"type" : "homework",
"score" : 81.58947824932574
},
{
"type" : "homework",
"score" : 69.09840625499065
}
]
}
I have a solution using pull that copes with removing a single element at a time but saw
I want to get a general solution that would cope with irregular schema where there would be between one and many elements to the array and I would like to remove all elements based on a condition.
I'm using mongodb driver 3.2.2 and saw this pullByFilter which sounded good
Creates an update that removes from an array all elements that match the given filter.
I tried this
Bson filter = and(eq("type", "homework"), lt("score", highest));
Bson u = Updates.pullByFilter(filter);
UpdateResult ur = collection.updateOne(studentDoc, u);
Unsurprisingly, this did not have any effect since I wasn't specifying the array scores
I get an error
The positional operator did not find the match needed from the query. Unexpanded update: scores.$.type
when I change the filter to be
Bson filter = and(eq("scores.$.type", "homework"), lt("scores.$.score", highest));
Is there a one step solution to this problem?
There seems very little info on this particular method I can find. This question may relate to How to Update Multiple Array Elements in mongodb

After some more "thinking" (and a little trial and error), I found the correct Filters method to wrap my basic filter. I think I was focusing on array operators too much.
I'll not post it here in case of flaming.
Clue: think "matches..." (as in regex pattern matching) when dealing with Filters helper methods ;)

Tell if a BasicMongoDBObject is valid from the .toString()?

I'd like to confirm that a parser I wrote is working correctly. It takes a JavaScript mongodb command that could be run from the terminal and converts it to a Java object for the MongoDB/Java drivers.
Is the following .toString() result valid?
{ "NumShares " : 1 , "attr4 " : 1 , "symbol" : { "$regex" : "ESLR%"}}
This was converted from the following JavaScript
db.STOCK.find({ "symbol": "ESLR%" }, { "NumShares" : 1, "attr4" : 1 })
And of course, the data as it rests in the collections
{ "_id" : { "$oid" : "538c99e41f12e5a479269ed1"} , "symbol" : "ESLR" , "NumShares" : 3471.0}
Thanks for all your help

You've combined the query document and the project document in that find() call in to one document. That's probably not what you want. But those documents are just json so you could use any parser to convert those. There's a few gotchas you'd have to deal with around ObjectIDs, dates, DBRefs, and particularly regular expressions but those can be managed without too much trouble by escaping/quoting them before parsing.

Mongo and Java: Create indexes for aggregation framework

Situation: I have collection with huge amount of documents after map reduce(aggregation). Documents in the collection looks like this:
/* 0 */
{
"_id" : {
"appId" : ObjectId("1"),
"timestamp" : ISODate("2014-04-12T00:00:00.000Z"),
"name" : "GameApp",
"user" : "test#mail.com",
"type" : "game"
},
"value" : {
"count" : 2
}
}
/* 1 */
{
"_id" : {
"appId" : ObjectId("2"),
"timestamp" : ISODate("2014-04-29T00:00:00.000Z"),
"name" : "ScannerApp",
"user" : "newUser#company.com",
"type" : "game"
},
"value" : {
"count" : 5
}
}
...
And I searching inside this collection with aggregation framework:
db.myCollection.aggregate([match, project, group, sort, skip, limit]); // aggregation can return result on Daily or Monthly time base depends of user search criteria, with pagination etc...
Possible search criteria:
1. {appId, timestamp, name, user, type}
2. {appId, timestamp}
3. {name, user}
I'm getting correct result, exactly what I need. But from optimisation point of view I have doubts about indexing.
Questions:
Is it possible to create indexes for such collection?
How I can create indexes for such object with complex _id field?
How I can do analog of db.collection.find().explain() to verify which index used?
And is good idea to index such collection or its my performance paranoia?
Answer summarisation:
MongoDB creates index by _id field automatically but that is useless in a case of complex _id field like in an example. For field like: _id: {name: "", timestamp: ""} you must use index like that: *.ensureIndex({"_id.name": 1, "_id.timestamp": 1}) only after that your collection will be indexed in proper way by _id field.
For tracking how your indexes works with Mongo Aggregation you can not use db.myCollection.aggregate().explain() and proper way of doing that is:
db.runCommand({
aggregate: "collection_name",
pipeline: [match, proj, group, sort, skip, limit],
explain: true
})
My testing on local computer sows that such indexing seems to be good idea. But this is require more testing with big collections.

First, indexes 1 and 3 are probably worth investigating. As for explain, you can pass explain as an option to your pipeline. You can find docs here and an example here

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark Java Aggregation - java

Related

Json to csv conversion in Spring boot

MongoDb Java Driver toJson() and $oid

mongodb java driver pullByFilter

Tell if a BasicMongoDBObject is valid from the .toString()?

Mongo and Java: Create indexes for aggregation framework

Categories

Resources