Iterate over large collection in MongoDB via spring-data - java

Friends!
I am using MongoDB in java project via spring-data. I use Repository interfaces to access data in collections. For some processing I need to iterate over all elements of collection. I can use fetchAll method of repository, but it always return ArrayList.
However, it is supposed that one of collections would be large - up to 1 million records several kilobytes each at least. I suppose I should not use fetchAll in such cases, but I could not find neither convenient methods returning some iterator (which may allow collection to be fetched partially), nor convenient methods with callbacks.
I've seen only support for retrieving such collections in pages. I wonder whether it is the only way for working with such collections?

Late response, but maybe will help someone in the future. Spring data doesn't provide any API to wrap Mongo DB Cursor capabilities. It uses it within find methods, but always returns completed list of objects. Options are to use Mongo API directly or to use Spring Data Paging API, something like that:
final int pageLimit = 300;
int pageNumber = 0;
Page<T> page = repository.findAll(new PageRequest(pageNumber, pageLimit));
while (page.hasNextPage()) {
processPageContent(page.getContent());
page = repository.findAll(new PageRequest(++pageNumber, pageLimit));
}
// process last page
processPageContent(page.getContent());
UPD (!) This method is not sufficient for large sets of data (see #Shawn Bush comments) Please use Mongo API directly for such cases.

Since this question got bumped recently, this answer needs some more love!
If you use Spring Data Repository interfaces, you can declare a custom method that returns a Stream, and it will be implemented by Spring Data using cursors:
import java.util.Stream;
public interface AlarmRepository extends CrudRepository<Alarm, String> {
Stream<Alarm> findAllBy();
}
So for the large amount of data you can stream them and process the line by line without memory limitation.
See https://docs.spring.io/spring-data/mongodb/docs/current/reference/html/#mongodb.repositories.queries

you can still use mongoTemplate to access the Collection and simply use DBCursor:
DBCollection collection = mongoTemplate.getCollection("boundary");
DBCursor cursor = collection.find();
while(cursor.hasNext()){
DBObject obj = cursor.next();
Object object = obj.get("polygons");
..
...
}

Use MongoTemplate::stream() as probably the most appropriate Java wrapper to DBCursor

Another way:
do{
page = repository.findAll(new PageRequest(pageNumber, pageLimit));
pageNumber++;
}while (!page.isLastPage());

Check new method to handle results per document basis.
http://docs.spring.io/spring-data/mongodb/docs/current/api/org/springframework/data/mongodb/core/MongoTemplate.html#executeQuery-org.springframework.data.mongodb.core.query.Query-java.lang.String-org.springframework.data.mongodb.core.DocumentCallbackHandler-

You may want to try the DBCursor way like this:
DBObject query = new BasicDBObject(); //setup the query criteria
query.put("method", method);
query.put("ctime", (new BasicDBObject("$gte", bTime)).append("$lt", eTime));
logger.debug("query: {}", query);
DBObject fields = new BasicDBObject(); //only get the needed fields.
fields.put("_id", 0);
fields.put("uId", 1);
fields.put("ctime", 1);
DBCursor dbCursor = mongoTemplate.getCollection("collectionName").find(query, fields);
while (dbCursor.hasNext()){
DBObject object = dbCursor.next();
logger.debug("object: {}", object);
//do something.
}

The best way to iterator over a large collection is to use the Mongo API directly. I used the below code and it worked like a charm for my use-case.
I had to iterate over more than 15M records and the document size was huge for some of those.
The following code is in Kotlin Spring Boot App (Spring Boot Version: 2.4.5)
fun getAbcCursor(batchSize: Int, from: Long?, to: Long?): MongoCursor<Document> {
val collection = xyzMongoTemplate.getCollection("abc")
val query = Document("field1", "value1")
if (from != null) {
val fromDate = Date(from)
val toDate = if (to != null) { Date(to) } else { Date() }
query.append(
"createTime",
Document(
"\$gte", fromDate
).append(
"\$lte", toDate
)
)
}
return collection.find(query).batchSize(batchSize).iterator()
}
Then, from a service layer method, you can just keep calling MongoCursor.next() on returned cursor till MongoCursor.hasNext() returns true.
An Important Observation: Please do not miss adding batchSize on 'FindIterable' (the return type of MongoCollection.find()). If you won't provide the batch size, the cursor will fetch initial 101 records and will hang after that (it tries to fetch all the remaining records at once).
For my scenario, I used the batch size as 2000, as it gave the best results during testing. This optimized batch size will be impacted by the average size of your records.
Here is the equivalent code in Java (removing createTime from query as it is specific to my data model).
MongoCursor<Document> getAbcCursor(Int batchSize) {
MongoCollection<Document> collection = xyzMongoTemplate.getCollection("your_collection_name");
Document query = new Document("field1", "value1");// query --> {"field1": "value1"}
return collection.find(query).batchSize(batchSize).iterator();
}

This answer is based on: https://stackoverflow.com/a/22711715/5622596
That answer needs a bit of an update as PageRequest has changed how it is being constructed.
With that said here is my modified response:
int pageNumber = 1;
//Change value to whatever size you want the page to have
int pageLimit = 100;
Page<SomeClass> page;
List<SomeClass> compondList= new LinkedList<>();
do{
PageRequest pageRequest = PageRequest.of(pageNumber, pageLimit);
page = repository.findAll(pageRequest);
List<SomeClass> listFromPage = page.getContent();
//Do something with this list example below
compondList.addAll(listFromPage);
pageNumber++;
}while (!page.isLast());
//Do something with the compondList: example below
return compondList;

Related

How do I query in MongoDB with Apache Spark JavaRDDs?

I'd like to imagine there's existing API functionality for this. Suppose there was Java code that looks something like this:
JavaRDD<Integer> queryKeys = ...; //values not particularly important
List<Document> allMatches = db.getCollection("someDB").find(queryKeys); //doesn't work, I'm aware
JavaPairRDD<Integer, Iterator<ObjectContainingKey>> dbQueryResults = ...;
Goal of this: After a bunch of data transformations, I end up with an RDD of integer keys that I'd like to make a single db query with (rather than a bunch of queries) based on this collection of keys.
From there, I'd like to turn the query results into a pair RDD of the key and all of its results in an iterator (making it easy to hit the ground going again for the next steps I'm intending to take). And to clarify, I mean a pair of the key and its results as an iterator.
I know there's functionality in MongoDB capable of coordinating with Spark, but I haven't found anything that'll work with this yet (it seems to lean towards writing to a database rather than querying it).
I managed to figure this out in an efficient enough manner.
JavaRDD<Integer> queryKeys = ...;
JavaRDD<BasicDBObject> queries = queryKeys.map(value -> new BasicDBObject("keyName", value));
BasicDBObject orQuery = SomeHelperClass.buildOrQuery(queries.collect());
List<Document> queryResults = db.getCollection("docs").find(orQuery).into(new ArrayList<>());
JavaRDD<Document> parallelResults = sparkContext.parallelize(queryResults);
JavaRDD<ObjectContainingKey> results = parallelResults.map(doc -> SomeHelperClass.fromJSONtoObj(doc));
JavaPairRDD<Integer, Iterable<ObjectContainingKey>> keyResults = results.groupBy(obj -> obj.getKey());
And the method buildOrQuery here:
public static BasicDBObject buildOrQuery(List<BasicDBObject> queries) {
BasicDBList or = new BasicDBList();
for(BasicDBObject query : queries) {
or.add(query);
}
return new BasicDBObject("$or", or);
}
Note that there's a fromJSONtoObj method that will convert your object back from JSON into all of the required field variables. Also note that obj.getKey() is simply a getter method associated to whatever "key" it is.

Spring mongo repository slice

I am using spring-sata-mongodb 1.8.2 with MongoRepository and I am trying to use the mongo $slice option to limit a list size when query, but I can't find this option in the mongorepository.
my classes look like this:
public class InnerField{
public String a;
public String b;
public int n;
}
#Document(collection="Record")
punlic class Record{
public ObjectId id;
public List<InnerField> fields;
public int numer;
}
As you can see I have one collection name "Record" and the document contains the InnerField. the InnerField list is growing all the time so i want to limit the number of the selected fields when I am querying.
I saw that: https://docs.mongodb.org/v3.0/tutorial/project-fields-from-query-results/
which is exactly what I need but I couldn't find the relevant reference in mongorepository.
Any ideas?
Providing an abstraction for the $slice operator in Query is still an open issue. Please vote for DATAMONGO-1230 and help us prioritize.
For now you still can fall back to using BasicQuery.
String qry = "{ \"_id\" : \"record-id\"}";
String fields = "{\"fields\": { \"$slice\": 2} }";
BasicQuery query = new BasicQuery(qry, fields);
Use slice functionality as provided in Java Mongo driver using projection as in below code.
For Example:
List<Entity> list = new ArrayList<Entity>();
// Return the last 10 weeks data only
FindIterable<Document> list = db.getDBCollection("COLLECTION").find()
.projection(Projections.fields(Projections.slice("count", -10)));
MongoCursor<Document> doc = list.iterator();
while(doc.hasNext()){
list.add(new Gson().fromJson(doc.next().toJson(), Entity.class));
}
The above query will fetch all documents of type Entity class and the "field" list of each Entity class document will have only last 10 records.
I found in unit test file (DATAMONGO-1457) way to use slice. Some thing like this.
newAggregation(
UserWithLikes.class,
match(new Criteria()),
project().and("likes").slice(2)
);

Getting count and list of ids using ElasticsearchTemplate in spring-data-elasticsearch

I am using spring-data-elasticsearch for a project to provide it with full text search functionality. We keep the real data in a relational database and relevant metadata along with respective id in elasticsearch. So for search results, only id field is required as the actual data will be retrieved from the relational database.
I am building the search query based on search criteria and then performing a queryForIds():
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withIndices(indexName)
.withTypes(typeName)
.withQuery(getQueryBuilder(searchParams))
.withPageable(pageable)
.build();
return elasticsearchTemplate.queryForIds(searchQuery);
If I also need the total count for that specific searchQuery, I can do another elasticsearchTemplate.count(searchQuery) call, but that will be redundant as I understand. I think there is a way to get both the list of id and total count by using something like elasticsearchTemplate.queryForPage() in a single call.
Also, can I use a custom class in queryForPage(SearchQuery query, Class<T> clazz, SearchResultMapper mapper) which is not annotated with #Document? The actual document class is really big, and if I am not sure if passing large classes will cause any extra load on the engine since there are over 100 fields to be json mapped, but all I need is the id field. I will have a .withFields("id") in the query builder anyway.
If you want to prevent two calls to elasticsearch, i would suggest to write an custom ResultsExtractor:
SearchQuery searchQuery = new NativeSearchQueryBuilder().withIndices(indexName)
.withTypes(typeName)
.withQuery(queryBuilder)
.withPageable(pageable)
.build();
SearchResult result = template.query(searchQuery, new ResultsExtractor<SearchResult>() {
#Override
public SearchResult extract(SearchResponse response) {
long totalHits = response.getHits()
.totalHits();
List<String> ids = new ArrayList<String>();
for (SearchHit hit : response.getHits()) {
if (hit != null) {
ids.add(hit.getId());
}
}
return new SearchResult(ids, totalHits);
}
});
System.out.println(result.getIds());
System.out.println(result.getCount());
where SearchResult is a custom class:
public class SearchResult {
List<String> ids;
long count;
//getter and setter
}
This way you can get the information you need from the elasticsearch SearchResponse.
Regarding your second question: As far as I can see, when calling queryForPage(SearchQuery query, Class<T> clazz, SearchResultMapper mapper)
the passed class is not checked for the #Document annotation. Just try it out!
One may also consider using AggregatedPage<T>. You can get the total number of records, total pages, current page records, etc. just like in Pageable<T>.
SearchQuery searchQuery = new NativeSearchQueryBuilder().withIndices(indexName)
.withTypes(typeName)
.withQuery(queryBuilder)
.withPageable(pageable)
.build();
AggregatedPage<ElasticDTO> queryResult = elasticsearchTemplate.queryForPage(searchQuery , ElasticDTO.class);

How to get just the desired field from an array of sub documents in Mongodb using Java

I have just started using Mongo Db . Below is my data structure .
It has an array of skillID's , each of which have an array of activeCampaigns and each activeCampaign has an array of callsByTimeZone.
What I am looking for in SQL terms is :
Select activeCampaigns.callsByTimeZone.label,
activeCampaigns.callsByTimeZone.loaded
from X
where skillID=50296 and activeCampaigns.campaign_id= 11371940
and activeCampaigns.callsByTimeZone='PT'
The output what I am expecting is to get
{"label":"PT", "loaded":1 }
The Command I used is
db.cd.find({ "skillID" : 50296 , "activeCampaigns.campaignId" : 11371940,
"activeCampaigns.callsByTimeZone.label" :"PT" },
{ "activeCampaigns.callsByTimeZone.label" : 1 ,
"activeCampaigns.callsByTimeZone.loaded" : 1 ,"_id" : 0})
The output what I am getting is everything under activeCampaigns.callsByTimeZone while I am expecting just for PT
DataStructure :
{
"skillID":50296,
"clientID":7419,
"voiceID":1,
"otherResults":7,
"activeCampaigns":
[{
"campaignId":11371940,
"campaignFileName":"Aaron.name.121.csv",
"loaded":259,
"callsByTimeZone":
[{
"label":"CT",
"loaded":6
},
{
"label":"ET",
"loaded":241
},
{
"label":"PT",
"loaded":1
}]
}]
}
I tried the same in Java.
QueryBuilder query = QueryBuilder.start().and("skillID").is(50296)
.and("activeCampaigns.campaignId").is(11371940)
.and("activeCampaigns.callsByTimeZone.label").is("PT");
BasicDBObject fields = new BasicDBObject("activeCampaigns.callsByTimeZone.label",1)
.append("activeCampaigns.callsByTimeZone.loaded",1).append("_id", 0);
DBCursor cursor = coll.find(query.get(), fields);
String campaignJson = null;
while(cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
campaignJson = campaignDBO.toString();
System.out.println(campaignJson);
}
the value obtained is everything under callsByTimeZone array. I am currently parsing the JSON obtained and getting only PT values . Is there a way to just query the PT fields inside activeCampaigns.callsByTimeZone .
Thanks in advance .Sorry if this question has already been raised in the forum, I have searched a lot and failed to find a proper solution.
Thanks in advance.
There are several ways of doing it, but you should not be using String manipulation (i.e. indexOf), the performance could be horrible.
The results in the cursor are nested Maps, representing the document in the database - a Map is a good Java-representation of key-value pairs. So you can navigate to the place you need in the document, instead of having to parse it as a String. I've tested the following and it works on your test data, but you might need to tweak it if your data is not all exactly like the example:
while (cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
List callsByTimezone = (List) ((DBObject) ((List) campaignDBO.get("activeCampaigns")).get(0)).get("callsByTimeZone");
DBObject valuesThatIWant;
for (Object o : callsByTimezone) {
DBObject call = (DBObject) o;
if (call.get("label").equals("PT")) {
valuesThatIWant = call;
}
}
}
Depending upon your data, you might want to add protection against null values as well.
The thing you were looking for ({"label":"PT", "loaded":1 }) is in the variable valueThatIWant. Note that this, too, is a DBObject, i.e. a Map, so if you want to see what's inside it you need to use get:
valuesThatIWant.get("label"); // will return "PT"
valuesThatIWant.get("loaded"); // will return 1
Because DBObject is effectively a Map of String to Object (i.e. Map<String, Object>) you need to cast the values that come out of it (hence the ugliness in the first bit of code in my answer) - with numbers, it will depend on how the data was loaded into the database, it might come out as an int or as a double:
String theValueOfLabel = (String) valuesThatIWant.get("label"); // will return "PT"
double theValueOfLoaded = (Double) valuesThatIWant.get("loaded"); // will return 1.0
I'd also like to point out the following from my answer:
((List) campaignDBO.get("activeCampaigns")).get(0)
This assumes that "activeCampaigns" is a) a list and in this case b) only has one entry (I'm doing get(0)).
You will also have noticed that the fields values you've set are almost entirely being ignored, and the result is most of the document, not just the fields you asked for. I'm pretty sure you can only define the top-level fields you want the query to return, so your code:
BasicDBObject fields = new BasicDBObject("activeCampaigns.callsByTimeZone.label",1)
.append("activeCampaigns.callsByTimeZone.loaded",1)
.append("_id", 0);
is actually exactly the same as:
BasicDBObject fields = new BasicDBObject("activeCampaigns", 1).append("_id", 0);
I think some of the points that will help you to work with Java & MongoDB are:
When you query the database, it will return you the whole document of
the thing that matches your query, i.e. everything from "skillID"
downwards. If you want to select the fields to return, I think those will only be top-level fields. See the documentation for more detail.
To navigate the results, you need to know that a DBObjects are returned, and that these are effectively a Map<String,
Object> in Java - you can use get to navigate to the correct node,
but you will need to cast the values into the correct shape.
Replacing while loop from your Java code with below seems to give "PT" as output.
`while(cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
campaignJson = campaignDBO.get("activeCampaigns").toString();
int labelInt = campaignJson.indexOf("PT", -1);
String label = campaignJson.substring(labelInt, labelInt+2);
System.out.println(label);
}`

Retrieve _ids from mongo database

I'm trying to get a list of mongo "_ids" from a database using Java. I don't need any other part of the objects in the database, just the "_id".
This is what I'm doing right now:
// Another method queries for all objects of a certain type within the database.
Collection<MyObject> thingies = this.getMyObjects();
Collection<String> ids = new LinkedList<String>();
for (MyObject thingy : thingies) {
ids.add(thingy.getGuid());
}
This seems horribly inefficient though... is there a way just to query mongo for objects of a certain type and return only their "_ids" without having to reassemble the entire object and extract it?
Thanks!
The find() method has an overload where you can pass the keys that you want to retrieve back from the query or those that you don't want.
So you could try this:
BasicDBObject qyery = new BasicDBObject("someKey","someValue");
BasicDBObject keys = new BasicDBObject("_id", 1);
DBCursor cursor = collection.find(query, keys);

Categories