How do I query in MongoDB with Apache Spark JavaRDDs?

How do I query in MongoDB with Apache Spark JavaRDDs? - java

I'd like to imagine there's existing API functionality for this. Suppose there was Java code that looks something like this:
JavaRDD<Integer> queryKeys = ...; //values not particularly important
List<Document> allMatches = db.getCollection("someDB").find(queryKeys); //doesn't work, I'm aware
JavaPairRDD<Integer, Iterator<ObjectContainingKey>> dbQueryResults = ...;
Goal of this: After a bunch of data transformations, I end up with an RDD of integer keys that I'd like to make a single db query with (rather than a bunch of queries) based on this collection of keys.
From there, I'd like to turn the query results into a pair RDD of the key and all of its results in an iterator (making it easy to hit the ground going again for the next steps I'm intending to take). And to clarify, I mean a pair of the key and its results as an iterator.
I know there's functionality in MongoDB capable of coordinating with Spark, but I haven't found anything that'll work with this yet (it seems to lean towards writing to a database rather than querying it).

I managed to figure this out in an efficient enough manner.
JavaRDD<Integer> queryKeys = ...;
JavaRDD<BasicDBObject> queries = queryKeys.map(value -> new BasicDBObject("keyName", value));
BasicDBObject orQuery = SomeHelperClass.buildOrQuery(queries.collect());
List<Document> queryResults = db.getCollection("docs").find(orQuery).into(new ArrayList<>());
JavaRDD<Document> parallelResults = sparkContext.parallelize(queryResults);
JavaRDD<ObjectContainingKey> results = parallelResults.map(doc -> SomeHelperClass.fromJSONtoObj(doc));
JavaPairRDD<Integer, Iterable<ObjectContainingKey>> keyResults = results.groupBy(obj -> obj.getKey());
And the method buildOrQuery here:
public static BasicDBObject buildOrQuery(List<BasicDBObject> queries) {
BasicDBList or = new BasicDBList();
for(BasicDBObject query : queries) {
or.add(query);
}
return new BasicDBObject("$or", or);
}
Note that there's a fromJSONtoObj method that will convert your object back from JSON into all of the required field variables. Also note that obj.getKey() is simply a getter method associated to whatever "key" it is.

Related

DynamoDB: Batch query items with highest range key given a set of hash key

I have a table Book with bookId and lastBorrowed as hash and range keys, respectively.
Let's say each time a book is borrowed, a new row is created.
(Yes, this is NOT sufficient and I can just add a column to keep track of the count and update lastBorrowed date. But let's just say I'm stuck with this design there's nothing I can do about it.)
Given a set of bookIds (or hashKeys), I would like to be able to query the last time each book is borrowed.
I attempted to use QueryRequest, but kept getting com.amazonaws.AmazonServiceException: Attempted conditional constraint is not an indexable operation
final Map<String, Condition> keyConditions =
Collections.singletonMap(hashKeyFieldName, new Condition()
.withComparisonOperator(ComparisonOperator.IN)
.withAttributeValueList(hashKeys.stream().map(hashKey -> new AttributeValue(hashKey)).collect(Collectors.toList())));
I also tried using BatchGetItemRequest, but it didn't work, either:
final KeysAndAttributes keysAndAttributes = new KeysAndAttributes() .withConsistentRead(areReadsConsistent);
hashKeys.forEach(hashKey -> { keysAndAttributes.addExpressionAttributeNamesEntry(hashKeyFieldName, hashKey); });
final Map<String, KeysAndAttributes> requestedItemsByTableName = newHashMap();
requestedItemsByTableName.put(tableName, keysAndAttributes);
final BatchGetItemRequest request = new BatchGetItemRequest().withRequestItems(requestedItemsByTableName);
Any suggestion would be much appreciated!
Or if someone can tell me this is currently not supported at all, then I guess I'll just move on!

You can do this, in fact its very easy. All you have to do is execute a Query for your bookId and then take the first result.
By the way, your table design sounds absolutely fine, the only problem is the attribute should probably be called borrowed rather than last borrowed.
You can have multiple results for a single bookId, but because lastBorrowed is your range key, the results will come back ordered by that attribute.
You seem to be using Legacy Functions, are you editing old code?
If not, execute your Query something like this:
//Setting up your DynamoDB connection
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard()
.withRegion(Regions.US_WEST_2).build();
DynamoDB dynamoDB = new DynamoDB(client);
Table table = dynamoDB.getTable("YOURTABLE");
//Define the Query
QuerySpec spec = new QuerySpec()
.withKeyConditionExpression("bookId = :book_id)
.withValueMap(new ValueMap()
.withString(":book_id", "12345")
.withScanIndexForward(true);
//Execute the query
ItemCollection<QueryOutcome> items = table.query(spec);
//Print out your results - take the first item
Iterator<Item> iterator = items.iterator();
while (iterator.hasNext()) {
System.out.println(iterator.next().toJSONPretty());
}

Convert list to hashmap

Title of the question may give you the impression that it is duplicate question, but according to me it is not.
I am just a few months old in Java and a month old in MongoDB, SpringBoot and REST.
I have a Mongo Collection with 3 fields in a document, _id (default field), appName and appKey. I am using list to iterate through all the documents and find one document whose appName and appKey matches with the one that is passed. This collection right now has only 4 entries, and thus it is running smoothly. But I was reading a bit about collections and found that if there will be a higher number of documents in a collection then the result with list will be much slower than hashMap.
But as I have already said that I am quite new to Java, I am having a bit of trouble converting my code to hashMap, so I was hoping if someone can guide me through this.
I am also attaching my code for reference.
public List<Document> fetchData() {
// Collection that stores appName and appKey
MongoCollection<Document> collection = db.getCollection("info");
List<Document> nameAndKeyList = new ArrayList<Document>();
// Getting the list of appName and appKey from info DB
AggregateIterable<Document> output = collection
.aggregate(Arrays.asList(new BasicDBObject("$group", new BasicDBObject("_id",
new BasicDBObject("_id", "$id").append("appName", "$appName").append("appKey", "$appKey"))
)));
for (Document doc : output) {
nameAndKeyList.add((Document) doc.get("_id"));
}
return nameAndKeyList;
}// End of Method
And then I am calling it in another method of the same class:
List<Document> nameAndKeyList = new ArrayList<>();
//InfoController is the name of the class
InfoController obj1 = new InfoController();
nameAndKeyList = obj1.fetchData();
// Fetching and checking if the appName & appKey pair
// is present in the DB one by one.
// If appName & appKey mismatches, it increments the value
// of 'i' and check them with the other values in DB
for (int i = 0; i < nameAndKeyList.size(); i++) {
"followed by my code"
And if I am not wrong then there will be no need for the above loop also.
Thanks in advance.

You just need a simple find query to get the record you need directly from Mongo DB.
Document document = collection
.find(new Document("appName", someappname).append("appKey", someappkey)).first();

First of all a list is not much slower or faster than an HashMap. A Hasmap is commonly used to save key-pair values such as "ID", "Name" or something like that. In your case I see you are using ArrayList without a specified size for the list. better use a linked list when you do not know the size because an arraylist is holding a array behind and extending this by copying. If you want to generate a Hasmap out of the List or use a Hasmap you need to map an ID and the value to the records.
HashMap<String /*type of the identifier*/, String /*type of value*/> map = new HashMap<String,String>();
for (Document doc : output) {
map.put(doc.get("_id"), doc.get("_value"));
}

First, avoid premature optimization (lookup the expression if you don’t know what it is). Put a realistic number of thousands of items containing near-realistic data in your list. Try to retrieve an item that isn’t there. This will force your for loop to traverse the entire list. See how long it takes. Try a number of times to get an impression of whether you get impatient. If you don’t, you’re done.
If you find out that you need a speed-up, I agree that HashMap is one of the obvious solutions to try. One of the first things to consider with this is a key type for you HashMap. As I understand, what you need to search for is an item where appName and appKey are both right. The good solution is to write a simple class with these two fields and equals and hashCode methods (I’ll call it DocumentHashMapKey for now, think of a better name). For hashCode(), try Objects.hash(appName, appKey). If it doesn’t give satisfactory performance with the data you have, consider alternatives. Now you are ready to build your HashMap< DocumentHashMapKey, Document>.
If you’re lazy or just want a first impression of how a HashMap performs, you may also build your keys by concatenating appName + "$##" + appKey (where the string in the middle is something that is unlikely to be part of a name or key) and use HashMap<String, Document>.
Everything I said can be refined depending on your needs. This was just to get you started.

Thanks everyone for your help, without which I would not have got to a solution.
public HashMap<String, String> fetchData() {
// Collection that stores appName and apiKey
MongoCollection<Document> collection = db.getCollection("info");
HashMap<String, String> appKeys = new HashMap<String, String>();
// Getting the list of appName and appKey from info DB
AggregateIterable<Document> output = collection
.aggregate(Arrays.asList(new BasicDBObject("$group", new BasicDBObject("_id",
new BasicDBObject("_id", "$id").append("appName", "$appName").append("appKey", "$appKey"))
)));
String appName = null;
String appKey = null;
for (Document doc : output) {
Document temp = (Document) doc.get("_id");
appName = (String) temp.get("appName");
appKey = (String) temp.get("appKey");
appKeys.put(appName, appKey);
}
return appKeys;
Calling the above method into another method of the same class.
InfoController obj = new InfoController();
//Fetching the values of 'appName' & 'appKey' sent from 'info' DB
HashMap<String, String> appKeys = obj.fetchData();
storedAppkey = appKeys.get(appName);
//Handling the case of mismatch
if (storedAppkey == null || storedApikey.compareTo(appKey)!=0)
{//Then the response and further processing that I need to do.
Now what HashMap has done is that it has made my code more readable and the 'for' loop that I was using for iterating is gone, although it might not make much difference in the performance as of now.
Thanks once again to everyone for your help and support.

Retrieve _ids from mongo database

I'm trying to get a list of mongo "_ids" from a database using Java. I don't need any other part of the objects in the database, just the "_id".
This is what I'm doing right now:
// Another method queries for all objects of a certain type within the database.
Collection<MyObject> thingies = this.getMyObjects();
Collection<String> ids = new LinkedList<String>();
for (MyObject thingy : thingies) {
ids.add(thingy.getGuid());
}
This seems horribly inefficient though... is there a way just to query mongo for objects of a certain type and return only their "_ids" without having to reassemble the entire object and extract it?
Thanks!

The find() method has an overload where you can pass the keys that you want to retrieve back from the query or those that you don't want.
So you could try this:
BasicDBObject qyery = new BasicDBObject("someKey","someValue");
BasicDBObject keys = new BasicDBObject("_id", 1);
DBCursor cursor = collection.find(query, keys);

Insert a complete list in MongoDB

I have List of complex object and the complex object contains more than two fields. Something like - {car1, car2, car3} and car has name and type field
Is there a simpler way of inserting a list of car? Something like
DBObject updateObject = new BasicDBObject().append("$push", new BasicDBObject().append("cars", cars)
I tried with $pushAll and it does not seems to be work. I did bit more research and I found it needs information about mapping and that is one of the reasons why this insertion is failing.
What's the best way to do this insertion into MongoDB? Some sample code or direction would be helpful. Please note this has to be done through Java.

Well, if you don't want to translate your car objects to DBObjects manually, there are Mapping Frameworks out there. like Morphia.
Personally, I would just wire up a mapping method manually though. Code could look like this (untested, typos to be expected)
BasicDBObject updateObject = new BasicDBObject();
BasicDBList dbCarList = mapCars(cars);
updateObject.append("$push", new BasicDBObject("cars", dbCarList));
...
private BasicDBList mapCars(List<Car> cars) {
BasicDBList result = new BasicDBList();
for (Car car: cars) {
BasicDBObject dbCar = new BasicDBObject();
dbCar.append("name", car.getName());
result.add(dbCar);
}
return result;
}
Update: as Sammaye pointed out in the comments, replace $push with $set for replacing the list. $push will append non-existing elements to the array without removing what's there form before.

Iterate over large collection in MongoDB via spring-data

Friends!
I am using MongoDB in java project via spring-data. I use Repository interfaces to access data in collections. For some processing I need to iterate over all elements of collection. I can use fetchAll method of repository, but it always return ArrayList.
However, it is supposed that one of collections would be large - up to 1 million records several kilobytes each at least. I suppose I should not use fetchAll in such cases, but I could not find neither convenient methods returning some iterator (which may allow collection to be fetched partially), nor convenient methods with callbacks.
I've seen only support for retrieving such collections in pages. I wonder whether it is the only way for working with such collections?

Late response, but maybe will help someone in the future. Spring data doesn't provide any API to wrap Mongo DB Cursor capabilities. It uses it within find methods, but always returns completed list of objects. Options are to use Mongo API directly or to use Spring Data Paging API, something like that:
final int pageLimit = 300;
int pageNumber = 0;
Page<T> page = repository.findAll(new PageRequest(pageNumber, pageLimit));
while (page.hasNextPage()) {
processPageContent(page.getContent());
page = repository.findAll(new PageRequest(++pageNumber, pageLimit));
}
// process last page
processPageContent(page.getContent());
UPD (!) This method is not sufficient for large sets of data (see #Shawn Bush comments) Please use Mongo API directly for such cases.

Since this question got bumped recently, this answer needs some more love!
If you use Spring Data Repository interfaces, you can declare a custom method that returns a Stream, and it will be implemented by Spring Data using cursors:
import java.util.Stream;
public interface AlarmRepository extends CrudRepository<Alarm, String> {
Stream<Alarm> findAllBy();
}
So for the large amount of data you can stream them and process the line by line without memory limitation.
See https://docs.spring.io/spring-data/mongodb/docs/current/reference/html/#mongodb.repositories.queries

you can still use mongoTemplate to access the Collection and simply use DBCursor:
DBCollection collection = mongoTemplate.getCollection("boundary");
DBCursor cursor = collection.find();
while(cursor.hasNext()){
DBObject obj = cursor.next();
Object object = obj.get("polygons");
..
...
}

Use MongoTemplate::stream() as probably the most appropriate Java wrapper to DBCursor

Another way:
do{
page = repository.findAll(new PageRequest(pageNumber, pageLimit));
pageNumber++;
}while (!page.isLastPage());

Check new method to handle results per document basis.
http://docs.spring.io/spring-data/mongodb/docs/current/api/org/springframework/data/mongodb/core/MongoTemplate.html#executeQuery-org.springframework.data.mongodb.core.query.Query-java.lang.String-org.springframework.data.mongodb.core.DocumentCallbackHandler-

You may want to try the DBCursor way like this:
DBObject query = new BasicDBObject(); //setup the query criteria
query.put("method", method);
query.put("ctime", (new BasicDBObject("$gte", bTime)).append("$lt", eTime));
logger.debug("query: {}", query);
DBObject fields = new BasicDBObject(); //only get the needed fields.
fields.put("_id", 0);
fields.put("uId", 1);
fields.put("ctime", 1);
DBCursor dbCursor = mongoTemplate.getCollection("collectionName").find(query, fields);
while (dbCursor.hasNext()){
DBObject object = dbCursor.next();
logger.debug("object: {}", object);
//do something.
}

The best way to iterator over a large collection is to use the Mongo API directly. I used the below code and it worked like a charm for my use-case.
I had to iterate over more than 15M records and the document size was huge for some of those.
The following code is in Kotlin Spring Boot App (Spring Boot Version: 2.4.5)
fun getAbcCursor(batchSize: Int, from: Long?, to: Long?): MongoCursor<Document> {
val collection = xyzMongoTemplate.getCollection("abc")
val query = Document("field1", "value1")
if (from != null) {
val fromDate = Date(from)
val toDate = if (to != null) { Date(to) } else { Date() }
query.append(
"createTime",
Document(
"\$gte", fromDate
).append(
"\$lte", toDate
)
)
}
return collection.find(query).batchSize(batchSize).iterator()
}
Then, from a service layer method, you can just keep calling MongoCursor.next() on returned cursor till MongoCursor.hasNext() returns true.
An Important Observation: Please do not miss adding batchSize on 'FindIterable' (the return type of MongoCollection.find()). If you won't provide the batch size, the cursor will fetch initial 101 records and will hang after that (it tries to fetch all the remaining records at once).
For my scenario, I used the batch size as 2000, as it gave the best results during testing. This optimized batch size will be impacted by the average size of your records.
Here is the equivalent code in Java (removing createTime from query as it is specific to my data model).
MongoCursor<Document> getAbcCursor(Int batchSize) {
MongoCollection<Document> collection = xyzMongoTemplate.getCollection("your_collection_name");
Document query = new Document("field1", "value1");// query --> {"field1": "value1"}
return collection.find(query).batchSize(batchSize).iterator();
}

This answer is based on: https://stackoverflow.com/a/22711715/5622596
That answer needs a bit of an update as PageRequest has changed how it is being constructed.
With that said here is my modified response:
int pageNumber = 1;
//Change value to whatever size you want the page to have
int pageLimit = 100;
Page<SomeClass> page;
List<SomeClass> compondList= new LinkedList<>();
do{
PageRequest pageRequest = PageRequest.of(pageNumber, pageLimit);
page = repository.findAll(pageRequest);
List<SomeClass> listFromPage = page.getContent();
//Do something with this list example below
compondList.addAll(listFromPage);
pageNumber++;
}while (!page.isLast());
//Do something with the compondList: example below
return compondList;

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I query in MongoDB with Apache Spark JavaRDDs? - java

Related

DynamoDB: Batch query items with highest range key given a set of hash key

Convert list to hashmap

Retrieve _ids from mongo database

Insert a complete list in MongoDB

Iterate over large collection in MongoDB via spring-data

Categories

Resources