How can I find Documents with Duplicate Array Elements? - java

Here is my Document:
{
"_id":"5b1ff7c53e3ac841302cfbc2",
"idProf":"5b1ff7c53e3ac841302cfbbf",
"pacientes":["5b20d2c83e3ac841302cfbdb","5b20d25f3e3ac841302cfbd0"]
}
I want to know how to find a duplicate entry in the array using MongoCollection in Java.
This is what I'm trying:
BasicDBObject query = new BasicDBObject("idProf", idProf);
query.append("$in", new BasicDBObject().append("pacientes", idJugador.toString()));
collection.find(query)

We can try to solve this in your Java-application code.
private final MongoCollection collection;
public boolean hasDuplicatePacientes(String idProf) {
Document d = collection.find(eq("idProf", idProf)).first();
List<String> pacientes = (List<String>) d.get("pacientes");
int original = pacientes.size();
if (original == 0) {
return false;
}
Set<String> unique = new HashSet(pacientes);
return original != unique.size();
}
Or if you're searching for a way to do this fully on db-side, I believe it's also possible with something like Neil Lunn provided.

The best approach really is to compare the length of the array to the length of an array which would have all duplicates removed. A "Set" does not have duplicate entries, so what you need to do is convert an array into a "Set" and test against the original.
Modern MongoDB $expr
Modern MongoDB releases have $expr which can be used with aggregation expressions in a regular query. Here the expressions we would use are $setDifference and $size along with $ne for the boolean comparison:
Document query = new Document(
"$expr", new Document(
"$ne", Arrays.asList(
new Document("$size", "$pacientes"),
new Document("$size",
new Document("$setDifference", Arrays.asList("$pacientes", Collections.emptyList()))
)
)
)
);
MongoCursor<Document> cursor = collection.find(query).iterator();
Which serializes as:
{
"$expr": {
"$ne": [
{ "$size": "$pacientes" },
{ "$size": { "$setDifference": [ "$pacientes", [] ] } }
]
}
}
Here it is actually the $setDifference which is doing the comparison and returning only unique elements. The $size is returning the length, both of the original document array content and the newly reduced "set". And of course where these are "not equal" ( the $ne ) the condition would be true meaning that a duplicate was found in the document.
The $expr operates on receiving a boolean true/false value in order whether to consider the document a match for the condition or not.
Earlier Version $where clause
Basically $where is a JavaScript expression that evaluates on the server
String whereClause = "this.pacientes.length != Object.keys(this.pacientes.reduce((o,e) => Object.assign(o, { [e.valueOf()]: null}), {})).length";
Document query = new Document("$where": whereClause);
MongoCursor<Document> cursor = collection.find(query).iterator();
You do need to have not explicitly disabled JavaScript evaluation on the server ( which is the default ) and it's not as efficient as using $expr and the native aggregation operators. But JavaScript expressions can be evaluated in the same way using $where, and the argument in Java code is basically sent as a string.
In the expression the .length is a property of all JavaScript arrays, so you have the original document content and the comparison to the "set". The Array.reduce() uses each array element as a "key" in a resulting object, from which the Object.keys() will then return those "keys" as a new array.
Since JavaScript objects work like a Map, only unique keys are allowed and this is a way to get that "set" result. And of course the same != comparison will return true when the removal of duplicate entries resulted in a change of length.
In either case of $expr or $where these are computed conditions which cannot use an index where present on the collection. As such it is generally recommended that additional criteria which use regular equality or range based query expressions which can indeed utilize an index be used alongside these expressions. Such additional criteria in the predicate would improve query performance greatly where an index is in place.

Related

Optimal way to maintain and quickly look up which objects contain a specific token (string) without maintaining two hash maps?

My system takes in a documentID and list of strings that represent tokens associated with the document. The primary metric I am trying to optimize for is returning a list of all the document ids that are associated with a given token. I am pretty confident I should start with something like HashMap<String, HashSet<Integer>> tokenLookupMap where the string is the token and the hash set is the set of documents IDs that contain that token. The tricky part is how to easily deal with documents being overwritten with new token lists (inserts completely overwrite the existing token lists with the new input). For example if my input looks like:
insertDocument(docId: 1, tokens: {token1, token2, token3} )
// query on token1 returns docIDs:[1]
insertDocument(docId: 2, tokens: {token1, token2, token3} )
// query on token1 returns docIDs:[1, 2]
insertDocument(docId: 1, tokens: {token4, token5, token6} )
// query on token1 returns docIDs:[2]
// query on token4 returns docIDs:[1]
I need to be able to efficiently update all the values in tokenLookupMap to reflect any tokens that are no longer present in the overridden document. Currently I'm maintaining a second hash map HashMap<Integer, HashSet<String>> documentLookupMap; to provide the "opposite" lookup perspective such that I can quickly look up what tokens are associated with a given document id and remove the old ones before an overwrite. This definitely allows me to optimize for lookups by token (insert time doesn't matter as much as queries) but it feels silly or even dangerous to have two structs that sort of represent the same thing and share a lot of overlapping space. Aside from the space increase and slight time increase on insert I technically run the risk of the structures getting out of sync.
Are there more optimal ways I could go about this? I could always put the two hash maps in a separate class and lock it down with limited public methods but are there ways to change the structure and perhaps avoid maintaining two structures altogether? Here's the most relevant code:
private HashMap<Integer, HashSet<String>> documentLookupMap;
private HashMap<String, HashSet<Integer>> tokenLookupMap;
private void insertDocument(int docId, HashSet<String> tokens ) {
if( documentLookupMap.containsKey(docId)) {
// if we've aleady indexed a doc with the same id we need to clean up first
var oldTokens = documentLookupMap.get(docId);
for (String token : oldTokens) {
tokenLookupMap.get(token).remove(docId);
// not sure if this is beneficial big picture on large data sets / space constraints
if(tokenLookupMap.get(token).isEmpty()) {
tokenLookupMap.remove(token);
}
}
}
documentLookupMap.put(docId, tokens);
for (String token : tokens) {
tokenLookupMap.computeIfAbsent(token,t->new HashSet<Integer>()).add(docId);
}
}
private Set<Integer> getDocsForToken(String token) {
return tokenLookupMap.containsKey(token) ? tokenLookupMap.get(token) : new HashSet<Integer>();
}
This needs to scale efficiently to tens of thousands of documents / tokens
Thanks in advance for any insights!
One thing that comes to my mind would be to maintain the Document-Token relation in separate classes and maintain 2 maps only for lookup:
class Document {
Integer docId;
//using arrays saves some space and tokens don't seem to change that often
Token[] tokens;
}
class Token {
String token;
Set<Document> documents;
}
Map<Integer, Document> docs = new HashMap<>();
Map<String, Token> tokens = new WeakHashMap<>();
When inserting a new document you basically clear the set of tokens and rebuild it:
private void insertDocument(int docId, Set<String> tokens ) {
Document doc = docs.computeIfAbsent(docId, ...);
//clear the tokens
for( Token old : doc.tokens ) {
old.documents.remove(doc);
}
//add the new tokens
Set<Token> newTokens = new HashSet<>();
for( String t: tokens ) {
Token newToken = tokens.computeIfAbsent(t, ...);
newToken.documents.add(doc);
newTokens.add(newToken);
}
doc.tokens = newTokens.toArray(new Token[0]);
}
Of course this could be optimized to ignore tokens that aren't changed.
Note the use of WeakHashMap for tokens: since tokens could be abandoned at some point they should not use up any more memory. WeakHashMap would allow the garbage collector to remove those that aren't reachably by anyone else, e.g. those that aren't listed in any document.
Of course it could take some time until gc kicks in and in the meantime token lookup could return tokens that aren't used anymore. You'd either need to filter those or remove the tokens from the token map manually if they don't have document references anymore.

Check if all object entities are equal using Java Streams [duplicate]

I am new to Java 8. I have a list of custom objects of type A, where A is like below:
class A {
int id;
String name;
}
I would like to determine if all the objects in that list have same name. I can do it by iterating over the list and capturing previous and current value of names. In that context, I found How to count number of custom objects in list which have same value for one of its attribute. But is there any better way to do the same in java 8 using stream?
You can map from A --> String , apply the distinct intermediate operation, utilise limit(2) to enable optimisation where possible and then check if count is less than or equal to 1 in which case all objects have the same name and if not then they do not all have the same name.
boolean result = myList.stream()
.map(A::getName)
.distinct()
.limit(2)
.count() <= 1;
With the example shown above, we leverage the limit(2) operation so that we stop as soon as we find two distinct object names.
One way is to get the name of the first list and call allMatch and check against that.
String firstName = yourListOfAs.get(0).name;
boolean allSameName = yourListOfAs.stream().allMatch(x -> x.name.equals(firstName));
another way is to calculate count of distinct names using
boolean result = myList.stream().map(A::getName).distinct().count() == 1;
of course you need to add getter for 'name' field
One more option by using Partitioning. Partitioning is a special kind of grouping, in which the resultant map contains at most two different groups – one for true and one for false.
by this, You can get number of matching and not matching
String firstName = yourListOfAs.get(0).name;
Map<Boolean, List<Employee>> partitioned = employees.stream().collect(partitioningBy(e -> e.name==firstName));
Java 9 using takeWhile takewhile will take all the values until the predicate returns false. this is similar to break statement in while loop
String firstName = yourListOfAs.get(0).name;
List<Employee> filterList = employees.stream()
.takeWhile(e->firstName.equals(e.name)).collect(Collectors.toList());
if(filterList.size()==list.size())
{
//all objects have same values
}
Or use groupingBy then check entrySet size.
boolean b = list.stream()
.collect(Collectors.groupingBy(A::getName,
Collectors.toList())).entrySet().size() == 1;

Hibernate-search search by list of numbers

I am working in a Hibernate-search, Java application with an entity which has a numeric field indexed:
#Field
#NumericField
private Long orgId;
I want to get the list of entities which match with a list of Long values for this property. I used the "simpleQueryString" because it allows to use "OR" logic with char | for several objective values. I have something like this:
queryBuilder.simpleQueryString().onField("orgId").matching("1|3|8").createQuery()
After run mi application I get:
The specified query '+(orgId:1 orgId:3 orgId:8)' contains a string based sub query which targets the numeric encoded field(s) 'orgId'. Check your query or try limiting the targeted entities.
So, Can some body tell me what is wrong with this code?, Is there other way to do what I need?.
=================================
UPDATE 1:
yrodiere' answer solves the issue, but I have another doubt, I want validate whether entities match other fields, I know I can use BooleanJuntion, but then I need mix "must" and "should" usages right?. i.e.:
BooleanJunction<?> bool = queryBuilder.bool();
for (Integer orgId: orgIds) {
bool.should( queryBuilder.keyword().onField("orgId").matching(orgId).createQuery() );
}
bool.must(queryBuilder.keyword().onField("name").matching("anyName").createQuery() );
Then, I am validating that the entities must match a "name" and also they match one of the given orgIds, Am I right?
As the error message says:
The specified query [...] contains a string based sub query which targets the numeric encoded field(s) 'orgId'.
simpleQueryString can only be used to target text fields. Numeric fields are not supported.
If your string was generated programmatically, and you have a list of integers, this is what you'll need to do:
List<Integer> orgIds = Arrays.asList(1, 3, 8);
BooleanJunction<?> bool = queryBuilder.bool();
for (Integer orgId: orgIds) {
bool.should( queryBuilder.keyword().onField("orgId").matching(orgId).createQuery() );
}
LuceneQuery query = bool.createQuery();
query will match documents whose orgId field contains 1, 3 OR 8.
See https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#_combining_queries
EDIT: If you need additional clauses, I'd recommend not mixing must and should in the same boolean junction, but nesting boolean junctions instead.
For example:
BooleanJunction<?> boolForOrgIds = queryBuilder.bool();
for (Integer orgId: orgIds) {
boolForOrgIds.should(queryBuilder.keyword().onField("orgId").matching(orgId).createQuery());
}
BooleanJunction<?> boolForWholeQuery = queryBuilder.bool();
boolForWholeQuery.must(boolForOrgIds.createQuery());
boolForWholeQuery.must(queryBuilder.keyword().onField("name").matching("anyName").createQuery());
// and add as many "must" as you need
LuceneQuery query = boolForWholeQuery.createQuery();
Technically you can mix 'must' and 'should', but the effect won't be what you expect: 'should' clauses will become optional and will only raise the score of documents when they match. So, not what you need here.

How to search a specific item in a string collection in java1.8 using Lambda?

I have a collection of items as under
List<String> lstRollNumber = new ArrayList<String>();
lstRollNumber.add("1");
lstRollNumber.add("2");
lstRollNumber.add("3");
lstRollNumber.add("4");
Now I want to search a particular RollNumber in that collection. Say
String rollNumberToSearch = "3";
I can easily do it by looping through the collection and checking for every items and if there is any match, i can break through the loop and return a true from the function.
But I want to use the Lambda expression for doing this.
In C# we use(among other options),
var flag = lstRollNumber.Exists(x => x == rollNumberToSearch);
How to do the same in Java 1.8 ?
I tried with
String rollNumberToSearch = "3";
Stream<String> filterRecs = lstRollNumbers.stream().filter(rn -> rn.equals(rollNumberToSearch));
But I know it is wrong?
Please guide.
Your mistake is that you are using stream intermediate operation filter without calling the stream terminal operation. Read about the types of stream operations in official documentation. If you still want to use filter (for learning purposes) you can solve your task with findAny() or anyMatch():
boolean flag = lstRollNumbers.stream().filter(rn -> rn.equals(rollNumberToSearch))
.findAny().isPresent();
Or
boolean flag = lstRollNumbers.stream().filter(rn -> rn.equals(rollNumberToSearch))
.anyMatch(rn -> true);
Or don't use filter at all (as #marstran suggests):
boolean flag = lstRollNumbers.stream().anyMatch(rn -> rn.equals(rollNumberToSearch));
Also note that method reference can be used here:
boolean flag = lstRollNumbers.stream().anyMatch(rollNumberToSearch::equals);
However if you want to use this not for learning, but in production code, it's much easier and faster to use good old Collection.contains:
boolean flag = lstRollNumber.contains("3");
The contains method can be optimized according to the collection type. For example, in HashSet it would be just hash lookup which is way faster than .stream().anyMatch(...) solution. Even for ArrayList calling contains would be faster.
Use anyMatch. It returns true if any element in the stream matches the predicate:
String rollNumberToSearch = "3";
boolean flag = lstRollNumbers.stream().anyMatch(rn -> rn.equals(rollNumberToSearch));

How to get just the desired field from an array of sub documents in Mongodb using Java

I have just started using Mongo Db . Below is my data structure .
It has an array of skillID's , each of which have an array of activeCampaigns and each activeCampaign has an array of callsByTimeZone.
What I am looking for in SQL terms is :
Select activeCampaigns.callsByTimeZone.label,
activeCampaigns.callsByTimeZone.loaded
from X
where skillID=50296 and activeCampaigns.campaign_id= 11371940
and activeCampaigns.callsByTimeZone='PT'
The output what I am expecting is to get
{"label":"PT", "loaded":1 }
The Command I used is
db.cd.find({ "skillID" : 50296 , "activeCampaigns.campaignId" : 11371940,
"activeCampaigns.callsByTimeZone.label" :"PT" },
{ "activeCampaigns.callsByTimeZone.label" : 1 ,
"activeCampaigns.callsByTimeZone.loaded" : 1 ,"_id" : 0})
The output what I am getting is everything under activeCampaigns.callsByTimeZone while I am expecting just for PT
DataStructure :
{
"skillID":50296,
"clientID":7419,
"voiceID":1,
"otherResults":7,
"activeCampaigns":
[{
"campaignId":11371940,
"campaignFileName":"Aaron.name.121.csv",
"loaded":259,
"callsByTimeZone":
[{
"label":"CT",
"loaded":6
},
{
"label":"ET",
"loaded":241
},
{
"label":"PT",
"loaded":1
}]
}]
}
I tried the same in Java.
QueryBuilder query = QueryBuilder.start().and("skillID").is(50296)
.and("activeCampaigns.campaignId").is(11371940)
.and("activeCampaigns.callsByTimeZone.label").is("PT");
BasicDBObject fields = new BasicDBObject("activeCampaigns.callsByTimeZone.label",1)
.append("activeCampaigns.callsByTimeZone.loaded",1).append("_id", 0);
DBCursor cursor = coll.find(query.get(), fields);
String campaignJson = null;
while(cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
campaignJson = campaignDBO.toString();
System.out.println(campaignJson);
}
the value obtained is everything under callsByTimeZone array. I am currently parsing the JSON obtained and getting only PT values . Is there a way to just query the PT fields inside activeCampaigns.callsByTimeZone .
Thanks in advance .Sorry if this question has already been raised in the forum, I have searched a lot and failed to find a proper solution.
Thanks in advance.
There are several ways of doing it, but you should not be using String manipulation (i.e. indexOf), the performance could be horrible.
The results in the cursor are nested Maps, representing the document in the database - a Map is a good Java-representation of key-value pairs. So you can navigate to the place you need in the document, instead of having to parse it as a String. I've tested the following and it works on your test data, but you might need to tweak it if your data is not all exactly like the example:
while (cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
List callsByTimezone = (List) ((DBObject) ((List) campaignDBO.get("activeCampaigns")).get(0)).get("callsByTimeZone");
DBObject valuesThatIWant;
for (Object o : callsByTimezone) {
DBObject call = (DBObject) o;
if (call.get("label").equals("PT")) {
valuesThatIWant = call;
}
}
}
Depending upon your data, you might want to add protection against null values as well.
The thing you were looking for ({"label":"PT", "loaded":1 }) is in the variable valueThatIWant. Note that this, too, is a DBObject, i.e. a Map, so if you want to see what's inside it you need to use get:
valuesThatIWant.get("label"); // will return "PT"
valuesThatIWant.get("loaded"); // will return 1
Because DBObject is effectively a Map of String to Object (i.e. Map<String, Object>) you need to cast the values that come out of it (hence the ugliness in the first bit of code in my answer) - with numbers, it will depend on how the data was loaded into the database, it might come out as an int or as a double:
String theValueOfLabel = (String) valuesThatIWant.get("label"); // will return "PT"
double theValueOfLoaded = (Double) valuesThatIWant.get("loaded"); // will return 1.0
I'd also like to point out the following from my answer:
((List) campaignDBO.get("activeCampaigns")).get(0)
This assumes that "activeCampaigns" is a) a list and in this case b) only has one entry (I'm doing get(0)).
You will also have noticed that the fields values you've set are almost entirely being ignored, and the result is most of the document, not just the fields you asked for. I'm pretty sure you can only define the top-level fields you want the query to return, so your code:
BasicDBObject fields = new BasicDBObject("activeCampaigns.callsByTimeZone.label",1)
.append("activeCampaigns.callsByTimeZone.loaded",1)
.append("_id", 0);
is actually exactly the same as:
BasicDBObject fields = new BasicDBObject("activeCampaigns", 1).append("_id", 0);
I think some of the points that will help you to work with Java & MongoDB are:
When you query the database, it will return you the whole document of
the thing that matches your query, i.e. everything from "skillID"
downwards. If you want to select the fields to return, I think those will only be top-level fields. See the documentation for more detail.
To navigate the results, you need to know that a DBObjects are returned, and that these are effectively a Map<String,
Object> in Java - you can use get to navigate to the correct node,
but you will need to cast the values into the correct shape.
Replacing while loop from your Java code with below seems to give "PT" as output.
`while(cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
campaignJson = campaignDBO.get("activeCampaigns").toString();
int labelInt = campaignJson.indexOf("PT", -1);
String label = campaignJson.substring(labelInt, labelInt+2);
System.out.println(label);
}`

Categories