Execute MongoTemplate.aggregate without row retrival - java

I'm using the Spring Mongo driver to execute a large mongo aggregation statement that will run for a period of time. The output stage of this aggregation writes the output of the aggregation into a new collection. At no point do I need to retrieve the results of this aggregation in-memory.
When I run this in Spring boot, the JVM is running out of memory doing row retrieval, although I'm not using or storing any of the results.
Is there a way to skip row retrieval using MongoTemplate.aggregate?
Ex:
mongoTemplate.aggregate(Aggregation.newAggregation(
Aggregation.sort(new Sort(new Sort.Order(Sort.Direction.DESC, "createdOn"))),
Aggregation.group("accountId")
.first("bal").as("bal")
.first("timestamp").as("effectiveTimestamp"),
Aggregation.project("_id", "effectiveTimestamp")
.andExpression("trunc(bal * 10000 + 0.5) / 100").as("bal"),
aggregationOperationContext -> new Document("$addFields", new Document("history",Arrays.asList(historyObj))),
// Write results out to a new collection - Do not store in memory
Aggregation.out("newBalance")
).withOptions(Aggregation.newAggregationOptions().allowDiskUse(true).build()),
"account", Object.class
);

Use AggregationOption - skipOutput() . This will not return a result in case of aggregation pipeline contains $out/$merge operation.
mongoTemplate.aggregate(aggregation.withOptions(newAggregationOptions().skipOutput().allowDiskUse(true).build()), "collectionNme", EntityClass.class);
If you are using MongoDriver without framework.
MongoClient client = MongoClients.create("mongodb://localhost:27017");
MongoDatabase database = client.getDatabase("my-collection");
MongoCollection<Document> model = database.getCollection(collectionName);
AggregateIterable<Document> aggregateResult = model.aggregate(bsonListOfAggregationPipeline);
// instead iterating over call toCollection() to skipResult
aggregateIterable.toCollection();
References:
https://jira.mongodb.org/browse/JAVA-3700
https://developer.mongodb.com/community/forums/t/mongo-java-driver-out-fails-in-lastpipelinestage-when-aggregate/9784

I was able to resolve this by using
MongoTempalte.aggregateStream(...).withOptions(Aggregation.newAggregationOptions().cursorBatchSize(0).build)

Related

Get Collections in Java from MongoDB

I have a local MongoDB instance running via shell on windows 10.
University provided a Java Project for us to learn about queries.
Now, I have a database ("imdb") and want to get two collections from it ("movies","tweets").
The problem is, one the one hand
List<String> test = mongo.getDatabaseNames();
System.out.println(test); //prints [admin,config,imdb,local]
...
db = mongo.getDB("imdb");
System.out.println(db.getCollectionNames()); //prints []
There seem to be no collections on imdb but
db.createCollection("movies", new BasicDBObject());
Returns a com.mongodb.CommandFailureException, stating that a collection 'imdb.movies' already exists.
So how do I ensure that Java actually "loads" the Collections?
For Clarification: My goal is to have
System.out.println(db.getCollectionNames());
to print [movies,tweets] instead of []
You could try
Set<String> colls = db.getCollectionNames();
for (String s : colls) {
System.out.println(s);
}
ref: code examples
Which version of mongodb shell you're using
If it's before 3.0, it will return no data for db.getCollectionNames() command
ref: mongodb docs
Since version 3.0 you should use MongoDatabase.listCollectionNames() method.
http://mongodb.github.io/mongo-java-driver/3.8/driver/tutorials/databases-collections/#get-a-list-of-collections
MongoDatabase db = client.getDatabase(dbName);
MongoIterable<String> names = db.listCollectionNames();
You can do like this:
//If you don't need connection's configures. i.e. your mongo is in you localhost at 127.0.0.1:27017
MongoClient cliente = new MongoClient(); //to connect into a mongo DB
MongoDatabase mongoDb = client.getDatabase("imdb"); //get database
MongoCollection<Document> robosCollection = mongoDb.getCollection("movies"); //get the name of the collection that you want
MongoCursor<Document> cursor = robosCollection.find().cursor();//Mongo Cursor interface implementing the iterator protocol
cursor.forEachRemaining(System.out::println); //print all documents in Collection using method reference

How to deal with a JSON stored in a row of a very large unpartitioned Hive table

I'm using spark SQL (spark 2.1) to read in a hive table.
The schema of the hive table is the following (simplified to the only field that is intesrting related to my problem, the other are useless) :
Body type:Bynary
The body is a JSON with multiple field and the one I'm interested in is an array. In each index of this array I have another JSON that contains a date.
My goal is to obtain a dataset filled with all the object of my array that have a date superior to "insert the wanted date".
To do so I use the following code :
SparkConfig conf = //set the kryo serializer and tungsten at true
SparkSession ss = //set the conf on the spark session
Dataset<String> dataset = creatMyDatasetWithTheGoodType(SS.SQL("select * from mytable "));
Dataset<String> finalds = dataset.flatmap(json->
List<String> l = new ArrayList<>();
List<String> ldate =//i use Jackson to obtain the array of date, this action return a list
For(int i = O; i < ldate.size ; i++) {
//if date is ok i add it to l
}
Return l.iterator()
});
(My code is working on a small dataset I gave it to give an idea of what I was doing)
The problem is that this hive table has like 22 millions lines.
The job turned for 14 hours and didn't finish (I killed it but no error or GC overhead)
I'm running it with yarn-client with 4 executors having 16 go of memory each. The driver has 4 go of memory. 1 core for the executor each.
I used a hdfs dfs -du hiveTableLocationPath and I had like 45 Go as a result.
What can I do to tune my job ?
I recommend to try this UDTF that allows working on json columns within hive
It is then possible to manipulate large json and fetch needed data in a distributed and optimized way.

MongoDB Java driver 3.2: parallel scan

the mongo java driver has an old (and through deprecation of MongoClient.getDB essentially deprecated) method to explicitly perform a parallel scan on a collection.
As far as I can see, this is something along the lines of:
DB db = mongoClient.getDB("mydb");
DBCollection coll = db.getCollection("testCollection");
ParallelScanOptions parallelScanOptions = ParallelScanOptions
.builder()
.numCursors(4)
.batchSize(1000)
.build();
List<Cursor> cursors = coll.parallelScan(parallelScanOptions);
...
The question is: is there a new alternative in driver 3.2 (without using the deprecated DB API)?
You can utilise runCommand() to execute parallelCollectionScan command directly.
For example:
MongoClient client = new MongoClient(new ServerAddress());
MongoDatabase database = client.getDatabase("databaseName");
Document commandResult = database.runCommand(new Document("parallelCollectionScan", "collectionName").append("numCursors", 3));
See also cursor batches

MongoInternalException while inserting into mongoDB

I was entering data into mongodb but suddenly encountered with this error.Don't know how to fix this.Is this due to maximum size exceeded?.If no then why I am getting this error?.Anyone know how to fix this? Below is the error which I encountered
Exception in thread "main" com.mongodb.MongoInternalException: DBObject of size 163745644 is over Max BSON size 16777216
I know my dataset is large...but is there any other solution??
the document you are trying to insert is exceeding the max BSON document size ie 16 MB
Here is the reference documentation : http://docs.mongodb.org/manual/reference/limits/
To store documents larger than the maximum size, MongoDB provides the GridFS API.
The mongofiles utility makes it possible to manipulate files stored in
your MongoDB instance in GridFS objects from the command line. It is
particularly useful as it provides an interface between objects stored
in your file system and GridFS.
Ref : MongoFiles
For inserting an document of size greater than 16MB you need to use GRIDFS by MongoDB. GridFS is an abstraction layer on mongoDB which divide data in chunks (by default 255K ). As you are using java, its simple to use with java driver too. I am inserting an elasticsearch jar(of size 20mb) in mongoDB. Sample code :
MongoClient mongo = new MongoClient("localhost", 27017);
DB db = mongo.getDB("testDB");
String newFileName = "elasticsearch-Jar";
File imageFile = new File("/home/impadmin/elasticsearch-1.4.2.tar.gz");
GridFS gfs = new GridFS(db);
//Insertion
GridFSInputFile inputFile = gfs.createFile(imageFile);
inputFile.setFilename(newFileName);
inputFile.put("name", "devender");
inputFile.put("age", 23);
inputFile.save();
//Fetch back
GridFSDBFile outputFile = gfs.findOne(newFileName);
Find out more here.
If you want to insert directly using mongoclient you will use mongofiles as mentioned in other answer.
Hope that helps.....:)

Mongo insert $currentDate in Java Driver

I've got a question about $currentDate
What is the best way to insert a document in mongo db so that it contains the "server time" (like ''now()'' in some RDBMSs) using the Java Driver?
For example, lest say I have a document like:
{
name : "John",
birthday : <$currentDate_goes_here>
}
What I want is to insert the document so that the evaluation of the date would be done by mongo server at the time of insertion on the server side.
This is critical because our servers might not be totally synchronized and there is a need to have the time we can rely on (for example the time on mongo server).
I'm using a standard java driver for mongo, so any code snippet in Java will be more than welcome.
This is what I've tried so far
MongoClient mongoClient = new MongoClient();
DB sampleDB = mongoClient.getDB("sampleDB");
BasicDBObject update =
new BasicDBObject("$set", new BasicDBObject("name","john")
.append("$currentDate", new BasicDBObject("birthday",true)));
sampleDB.getCollection("col1").insert(update);
This thing fails on the following exception:
java.lang.IllegalArgumentException: Document field names can't start with '$' (Bad Key: '$set')
at com.mongodb.DBCollection.validateKey(DBCollection.java:1845)
at com.mongodb.DBCollection._checkKeys(DBCollection.java:1803)
at com.mongodb.DBCollection._checkObject(DBCollection.java:1790)
at com.mongodb.DBCollectionImpl.applyRulesForInsert(DBCollectionImpl.java:392)
at com.mongodb.DBCollectionImpl.insertWithCommandProtocol(DBCollectionImpl.java:381)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:186)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:165)
at com.mongodb.DBCollection.insert(DBCollection.java:93)
at com.mongodb.DBCollection.insert(DBCollection.java:78)
at com.mongodb.DBCollection.insert(DBCollection.java:120)
In which case the answer is fairly simple. It is really about serializing from java BasicDBObject classes to the basic MongoDB interpretation. Without respect to your actual "query" document the "update" document part of your statement should be:
BasicDBObject update = new BasicDBObject("$set", new BasicDBObject("name","john")
.append("$currentDate", new BasicDBObject("birthrhday",true))
;
Which will indeed use the "server time" at the point of "update insertion" or "modification" with respect to the $currentDate modifier as used.
Just to be clear here, you don't use the .insert() method but an "upsert"operation with .insert(). The "query" and "update" syntax applies. Also see the $setOnInsert operator for specifically not modifying existing documents.
You can also use aggregation variable "$$NOW" if you are using an aggregation pipeline with update method.
MongoClient mongoClient = new MongoClient();
DB sampleDB = mongoClient.getDB("sampleDB");
BasicDBObject update =
new BasicDBObject("$set", new BasicDBObject("name","john")
.append("birthday", new BsonString("$$NOW")));
sampleDB.getCollection("col1").updateOne(query, List.of(update));
You can also use "$$NOW" with aggregation operators such as $add, $subtract, and many more, to compute more specific values (including dates) on the database side.
If you want to pass the Application Server's time instead of Database time, use the following code to send the current time. You should decide whether to use this in case if the Application Server time differs from Database Server time.
new BsonDateTime(Instant.now().toEpochMilli())
Sample Code:
MongoClient mongoClient = new MongoClient();
DB sampleDB = mongoClient.getDB("sampleDB");
BasicDBObject update =
new BasicDBObject("$set", new BasicDBObject("name","john")
.append("birthday", new BsonDateTime(Instant.now().toEpochMilli())));
sampleDB.getCollection("col1").updateOne(query, update);

Categories