Find and delete duplicates in a Lotus Notes database - java

I am very new to lotus notes. Recently my team mates were facing a problem regarding the Duplicates in Lotus notes as shown below in the CASE A and CASE B.
So we bought a app named scanEZ (Link About scanEX). Using this tool we can remove the first occurrence or the second occurrence. As in the case A and Case B the second items are considered as redundant because they do not have child. So we can remove all the second item as given below and thus removing the duplicates.
But in the Case 3 the order gets changed, the child item comes first and the Parent items comes second so i am unable to use the scanEX app.
Is there any other better way or software or script to accomplish my task. As I am new to this field I have not idea. Kindly help me.
Thanks in advance.

Probably the easiest way to approach this would be to force the view to always display documents with children first. That way the tool you have purchased will behave consistently for you. You would do this by adding a hidden sorted column to the right of the column that that you have circled. The formula in this column would be #DocChildren, and the sort options for the column would be set to 'Descending'. (Note that if you are uncomfortable making changes in this view, you can make a copy of it, make your changes in the copy, and run ScanEZ against the copy as well. You can also do all of this in a local replica of the database, and only replicate it back to the server when you are satisified that you have the right results.)
The other way would be to write your own code in LotusScript or Java, using the Notes classes. There are many different ways that you could write that code,

I agree with Richard's answer. If you want more details on how to go thru the document collection you could isolate the documents into a view that shows only the duplicates. Then write an agent to look at the UNID of the document, date modified and other such data elements to insure that you are getting the last updated document. I would add a field to the document as in FLAG='keep'. Then delete documents that don't have your flag in the document with a second agent. If you take this approach you can often use the same agents in other databases.
Since you are new to Notes keep in mind that Notes is a document database. There are several different conflicts like save conflicts or replication conflicts. Also you need to look at database settings on how duplicates can be handled. I would read up on these topics just so you can explain it to your co-workers/project manager.
Eventually in your heavily travelled databases you might be able to automate this process after you work down the source of the duplicates.

These are clearly not duplicates.
The definition of duplicate is that they are identical and so it does not matter which one is kept and which one is removed. To you, the fact that one has children makes it more important, which means that they are not pure duplicates.
What you have not stated is what you want to do if multiple documents with similar dates/subjects have children (a case D if you will).
To me this appears as three separate problems.
The first problem is to sort out the cases where more than one
document in a set has children.
Then sort out the cases where only one document in a set has children.
Then sort out the cases where none of the documents in a set has children.
The approach in each case will be different. The article from Ytira only really covers the last of these cases.

Related

Choosing databasetype for a decentralized calendar project

I am developing a calendar system which is decentralised. It should save the data on each device and synchronise if they have both internet connection. My first idea was, just using a relational database and try to synchronise data after connection. But the theory says something else. The Brewers CAP-Theorem describes the theory behind it, but i am not sure if this theorem maybe is outdated. If i use this theorem i have "AP [Availability/Partition Tolerance] Systems". "A" because i need at any given time the data for my calendar and "P" because it can happen, that there is no connection between the devices and the data can't be synchronised. The example databases are CouchDB, RIAK or Cassandra. I have worked only with relational databases and doesn't know how to go on now. Is it that bad to use a relational Database for my project?
This is for my bachelor thesis. I just wanted to start using Postgres but then i found this theorem...
The whole project is based on Java.
I think the CAP theorem isn't really helpful to your scenario. Distributed systems that deal with partitions need to decide what to when one part wants to make a modification to the data, but can't reach the other part. One solution is to make the write wait - and this is giving up the "availability" because of the "partition", one of the options presented by the CAP theorem. But there are more useful options. The most useful (highly-available) option is to allow both parts to be written independently, and reconcile the conflicts when they can connect again. The question is how to do that, and different distributed systems choose different approaches.
Some systems, like Cassandra or Amazon's DynamoDB, use "last writer wins" - when we see a conflict between two conflicting writes, the last one (according some synchronized clock) wins. For this approach to make sense you need to be very careful about how you model your data (e.g., watch out for cases where the conflict resolution results in an invalid mixture of two states).
In other systems (and also in Cassandra and DynamoDB - in their "collection" types) writes can still happen independently on different nodes, but there is more sophisticat conflict resolution. A good example is Cassandra's "list": One can send an update saying "add item X to the list", and another update saying "add item Y to the list". If these updates happen on different partitions, the conflict is later resolved by adding both X and Y to the list. The data structures such as this list - which allows the content to be modified independently in certain ways on two nodes and then automatically reconciled in a sensible way, is known as a Conflict-free Replicated Data Type (CRDT).
Finally, another approach was used in Amazon's Dynamo paper (not to be confused by their current DynamoDB service!), known as "vector clocks": When you want to write to an object - e.g., a shopping cart - you first read the current state of the object and get with it a "vector clock", which you can think of as the "version" of the data you got. You then make the modification (e.g., add an item to the shopping cart), and write back the new version while saying what was the old version you started with. If two of these modifications happen on parallel on different partitions, we later need to reconcile the two updates. The vector clocks allow the system to determine if one modification is "newer" than the other (in which case there is no conflict), or they really do conflict. And when they do, application-specific logic is used to reconcile the conflict. In the shopping cart example, if we see the conflict is that in one partition item A was added to the shopping cart and in the other partition, item B was added to the shopping cart, the straightforward resolution is to just add both times A and B to the shopping cart.
You should probably pick one of these approaches. Just saying "the CAP theorem doesn't let me do this" is usually not an option ;-) In fact, in some ways, the problem you're facing is different than some of the systems I mentioned. In those systems, the common case is every node is always connected (no partition), with very low latency, and they want this common case to be fast. In your case, you can probably assume the opposite: the two parts are usually not connected, or if they are connected there is high latency, so conflict resolution because the norm, rather than the exception. So you need to decide how to do this conflict resolution - what happens if one adds a meeting on one device and a different meeting on the other device (most likely, just keep both as two meetings...), how do you know that one device modified a pre-existing meeting and didn't add a second meeting (vector clocks? unique meeting ids? etc.) so the conflict resolution ends up fixing the existing meeting instead of adding a second one? And so on. Once you do that, where you store the data on both partitions (probably completely different database implementations in the client and server) and which protocol you send the updates on become implementation details.
There's another issue you'll need to consider. When do we do these reconciliations? In many systems like I listed above, the reconciliation happens on read: If the client wants to read data and we suddenly see two conflicting versions on two reachable nodes, we reconcile. In your calendar application, you need a slightly different approach: It is possible that the client will only ever try to read (use) the calendar when not connected. You need to use the rare opportunities when he is connected to reconcile all the differences. Moreover, you may need to "push" changes - e.g., if the data on the server changed, the client may need to be told, "hey, I have some changed data, come and reconcile", so the end-user will immediately see an announcement on a new meeting, for example, that was added remotely (e.g., perhaps by a different user sharing the same calendar). You'll need to figure out how you want to do this. Again, there is no magic solution like "use Cassandra".

Realm for Java/Android - preserving the state of query/result

In the project currently under development, we are integrating the Realm Database into the customer's app to improve responsiveness while working on a huge data set of ~20.000 records. For on-screen presentation, we are incorporating Realm's Android Recyclerview. Majority of the use cases are read operations, followed up by the possibility of advanced search and/or filtering of the records.
Where the shoe pinches are that on some of our views, from all the data of given type only a subset of records is supposed to be displayed, selected by the back-end. Using the information passed by the API, we perform the initial filtering and set up the view.
Now, using the aforementioned technologies, is there a readable and maintainable way to store either this pre-filtered subset or the query fetching it for further reference, so that the initial state of the view can always be restored once the searchview and/or filters are cleared? Or should storing the API response re-applying the conditions given through it be the only way to do it? Applying any new conditions to the query seems to alter it for good, the same goes for applying new queries to the results. Shouldn't there be a way to create ourselves a fresh result set based on an old one but without disturbing the latter?
Edit: Our app being 'bilingual', both Java- and Kotlin-based solutions are welcomed, should they differ.
As we came to realise after a while earlier this week, and just as #EpicPandaForce have mentioned in the comments, while the RealmQuery object cannot be "snapshoot" by assigning it to a spare variable before extending it, the same is not true for RealmResults objects. And so:
RealmResults<Obj> widerResults = realmInstance.where(Obj.class).in("id", idArray).findAll;
RealmResults<Obj> narrowerResults = widerResults.where().equalTo("flag", true).findAll;
Will provide two independent result sets. The wider one can be used as per the use case I highlighted - to treat it as a starting point for further subqueries. The changes to objects found in the sets themselves will still be reflected in both sets.
Providing an answer for all the lost souls out there, should they get stuck like we did.

AEM: After deleting user groups, rep:policy nodes remain intact

I'm quite stunned at what I have found while tinkering with AEM (don't think it matters but for accuracy of my reporting I'm using 6.1) trying to automate my group permission creation. I have this group called aem-tools-readonly that has a specific set of permissions on it. No problem there, the thing that kind of surprises me is the following, if I happen to delete said group it does not delete the respective rep:policy nodes that correspond to that group. So if I re-create aem-tools-readonly it picks up the same config for my group. I am wondering a couple of things.
Should I be concerned security wise of creating holes in my permission scheme if groups get deleted as I move along with my projects ?
Why aren't these rep:policy nodes not getting deleted, is there a
valid reason ?
How can I easily delete all rep:policy nodes of for example my aem-tools-readonly group ?
Any information/thoughts are welcomed ...
Thanks
As far as I know this has always been this way.
This is how the ACL's implementation works in CRX.
To fix that prior to deleting a group you could clear its whole accesses - probably by deleting the proper entries lying under any rep:policy.
There is no easy (automatic way) to do that. just code. it should be quite easy though to find any descendant of any rep:policy that has your group name within it.

Automated vs Custom Lucene Scoring

I have started working on Lucene (v 4.10.2) Search Based Ranking/Scoring.
Consider the following Scenario: I am searching 'Mark' in my search box. Auto-complete result shows Top 5 people named 'Mark' (although there might be hundreds of Mark in the Lucene index files).
I go on Mark Zuckerberg's profile which is placed on 4th place in the beginning of the search. Say I have clicked his profile a lot of times. Now according to me, next time I search 'Mark', 'Mark Zuckerberg' should come at the top of the list.
Several questions coming in my mind (even I don't know that I'm on right track or not):
1) How to achieve this using Lucene library ? (Automated or custom based scoring)
2) Can we change the scoring after any search?
3) Does Lucene library stores the scoring in indexed files?
4) Can we store the scoring in the indexed files?
Please let me know if I'm on the right track or not.
This is what I would try, regardless any performance and index
maintainability issues for now.
I would add a multivalued string field for users that have at least once hit the
profile document.
Every time a user (say "vipul") hits an auto-completed profile (say
"Mark Zuckerberg") I would add the username to the special multivalued string
field in the profile document.
When searching I would add a term in the special field with the current username
as the value, boosting it, so it comes first in the searches.
Now, some performance. Since updating the full document only to update a single
field could be quite expensive, I would try something with the
SortedSetDocValuesField. I honestly haven't tried anything yet with this
relatively new field. But if I understand well, it was designed for
situations like this.

Lucene search results sort by custom order list (unique to each user)

I have authenticated users in my application who have access to a shared database of up to 500,000 items. Each of the users has their own public facing web site and needs the ability to prioritize the items on display (think upvote) on their own site.
out of the 500,000 items they may only have up to 200 prioritized items, the order of the rest of the items is of less importance.
Each of the users will prioritize the items differently.
I initially asked a similar mysql question here Mysql results sorted by list which is unique for each user and got a good answer but i believe a better option may be to opt for a non sql indexed solution.
Can this be done in Lucene?, is there another search technology which would be better for this.
ps. Google implements a similar type setup with their search results where you can prioritize and exclude your own search results if you are logged in.
Update: re-tagged with sphinx as i have been reading the documentation and i believe it may be able to do what i am looking for with "per-document attribute values" stored in memory - interested to hear any feedback on this from sphinx gurus
You'll definitely want to store the id of item in each document object when building your index. There's a few ways to do the next step, but an easy one would be take the prioritized items and add them to your search query, something like this for each special item:
"OR item_id=%d+X"
where X is the amount of boost you'd like to use. You'll probably need to empirically tweak this number to make sure that just being "upvoted" doesn't put it to the top of a list searching for something totally unrelated.
Doing it this way will at least prevent you from a lot of annoying postprocessing steps that would require you to iterate over the whole result set -- hopefully the proper sorting will be there right from querying the index.

Categories