Optimal diff between object lists in Java

Optimal diff between object lists in Java - java

I have a List of Java objects on my server which is sent to the client through some serialization mechanism. Once in a while the List of objects gets updated on the server, that is, some objects get added, some get deleted and others just change their place in the List. I want to update the List on the client side as well, but send the least possible data. Especially, I don't want to resend Objects which are already available on the client.
Is there a library available which will produce some sort of diff from the two lists, so that I can only send the difference and the new Objects accross the wire?
I have found several Java implementation of the unix diff command, but this algorithm is unpractical for order changes. ie. [A,B,C] -> [C,B,A] could be sent as only place changes [1->3] [3->1], while diff will want to resend the whole A and C objects (as far as I understand).

I would do this by making the public interface of the objects wherever they are modified silently keep a log of changes made, that is, add an object representing each modification to a list of modifications.
That way you have a minimal list of the exact changes to send to the other machine, rather than needing to infer them using fallible guesswork by comparing old versus new.
To create the object model so that it automatically records changes to itself, you will likely benefit from some code generation or AOP to avoid a lot of repetitive patterns. Methods that set the value of a property, or add/remove from lists, all need to call into a central log shared by the object hierarchy.

You can "pretend" that your list is a string, and use Damerau–Levenshtein distance to find the minimum operations necessary to transform one to another, allowing insertion, deletion, substitution, and transposition (which is what your example suggests).
I'm not aware of a mature and/or stable implementation, and even if one exists, it's likely targeted for strings, so adapting to a list of abstract value types would be a challenge. Implementing your own is also likely to be a challenging task, but it's certainly doable.

JaVers lib (http://javers.org) do the job.
Diff diff = javers.compare(list1, list2);
Diff contains list of changes like: object-added, object-removed, index-changed

For now I'll just send the complete List over the wire but instead of the objects, I use only a unique ID. If the client does not have the object locally, it requests it using the ID.
This is certainly less beautiful than an optimal algorithm but has the expected result: expensive objects are only sent once over the wire.

Related

Why LogbackMDC keep track of last operation?

I am looking at the implementation of LogbackMDCAdapter, and it keeps track of lastOperation I don't understand the reason for doing this, anyone has an idea why this is done?
And, why duplicateAndInsertNewMap is required?

Based on the comment here the map copying is required for serialization purposes
Each time a value is added, a new instance of the map is created. This
is to be certain that the serialization process will operate on the
updated map and not send a reference to the old map, thus not allowing
the remote logback component to see the latest changes.
This refers to the behaviour of ObjectOutputStream sending references to previously written objects instead of the full object, unless using the writeUnshared method.
It is not directly obvious why it's possible to skip copying unless there's a get/put combination, but apparently even if you have multiple put operations in a row, the serialization will work properly as long as the map is copied only when a put/remove is performed right after a get. So this is a performance optimization to avoid copying the map unnecessarily when putting several items in it.

combined vs. separate backend calls

I try to figure out the best solution for a use case I'm working on. However, I'd appreciate getting some architectural advice from you guys.
I have a use case where the frontend should display a list of users assigned to a task and a list of users who are not assigned but able to be assigned to the same task.
I don't know what the better solution is:
have one backend call which collects both lists of users and sends them
back to the frontend within a new data class containing both lists.
have two backend calls which collect one of the two lists and send them
back separately.
The first solution's pro is the single backend call whereas the second solution's pro is the reusability of the separate methods in the backend.
Any advice on which solution to prefer and why?
Is there any pattern or standard I should get familiar with?

When I stumble across the requirement to get data from a server I start with doing just a single call for, more or less (depends on the problem domain), a single feature (which I would call your task-user-list).
This approach saves implementation complexity on the client's side and saves protocol overhead for transactions (TCP header, etc.).
If performance analysis shows that the call is too slow because it requests too much data (user experience suffers) then I would go with your 2nd solution.
Summed up I would start with 1st approach. Optimize (go with more complex solution) when it's necessary.

I'd prefer the two calls because of the reusability. Maybe one day you need add a third list of users for one case and then you'd need to change the method if you would only use one method. But then there may be other use cases which only required the two lists but not the three, so you would need to change code there as well. Also you would need to change all your testing methods. If your project gets bigger this makes your project hard to update or fix. Also all the modifications increase the chances of introducing new bugs as well.
Seeing the methods callable by the frontend of the backend like an interface helps.
In general an interface should be open for extension but closed on what the methods return and require. As otherwise a slight modification leads to various more modifications.

How can I compare 2 large objects running on separate jvm's?

I am looking at changing the way some large objects which maintain the data for a large website are reloaded, they contain data relating to catalogue structure, products etc and get reloaded daily.
After changing how they are reloaded I need to be able to see whether there is any difference in the resulting data so the intention is to reload both and compare the content.
There may be some issues(ie. lists used when ordering is not imporatant) that make the comparison harder so I would need to be able to alter the structure before comparison. I have tried to serialise to json using gson but I run out of memory. I'm thinking of trying other serialisation methods or writing my own simple one.
I imagine this is something that other people will have wanted to do when changing critical things like this but I haven't managed to find anythign about it.

In this special case (separate VMs) I suggest adding something like a dump method to each class which writes the relevant content into a file (human readable text). This method calls dump on each aggregated object as well.
In the end you have to files from each VM, and then you can compare them using an MD5 checksum for example.
This is probably a lot of work, but if you encounter any differences, you can use diff on both files, and this will be a great help.
You can start with a simple version, and refine it step-by-step by adding more output.
Adding (complete) serialization later to a class is cumbersome. There might be tools which simplify this (using reflection etc.), but in my experience you have to tweak your classes: Exclude fields which are not relevant, define a sort order for lists, cyclic relations etc.
Actually I use a similar approach for the same reasons (to check whether a new version still returns the same result): The application contains multiple services (for each version), the results are always data transfer objects, serialization is added immediately to the DTOs, and DTOs must provide a comparison method dedicated for this purpose.

Looking at the complications and memory issues, also as you have mentioned you dont want to maintain versions, i would look to use database for comparison.
It will need some effort in terms of mapping your data in jvm to db table but once you have done that, it will be staright forward. You can dump data from one large object in db tables and then you can simply run a check from 2nd object in db.
Creating a stored proc can simplify things. This solution can support data check from any number of jvms.

implementing sorted "watchlist" class in java: what instruments to use?

I need to implement a WatchList class as a part of a Java client-server app. WL is essentially an array of Items, each of which has a timestamp. I am responsible for for the client side of the app. The WL might be updated on the client side manually, i.e. new elements can be added to it. It can also be modified with a regularly schedule update from the server. Similarly, regular uploads are also performed with the terms that had been added manually being sent to the server.
Since I am fairly new to Java, I need advice on what built-in instruments (classes) I should use to implement this WL class. It will obviously be some type of sorted structure with a custom comparator that would compare dates. I will probably also want to maintain it in a newest-items-first order so that I could quickly retrieve the newest items to send to the server. In which case the item that will be received during a download from the server will be added to the beginning of the list rather than at the end. Or keeping it in the newest-items-last order is just as efficient?
Thanks much!

Newest first or newest last, it's a purely functional choice. The comparator of one is just the inverse of the comparator of the other.
You'll need to understand the classes of the collections framework.
I wouldn't keep unsaved schedules at client side. If the client crashes, you lose all the unsaved items. Why don't you simply call the server each time an item is created at client-side?

What are the key points to make sure while implementing serialization

What are the key points in checklist to be checked while implementing good serialization in java (I'm not really looking for , implements Serializable, write, readObject etc).
Instead , How to reduce the size of the object , Probably how to make the object in zip format and send over the network etc..
How to ensure the secure mode of transfer.
any others like this..

How to reduce the size of the object: new ObjectOutputStream(new GZipOutputStream(new BufferedOutputStream(out)). But this is a space-time tradeoff. You may find that it makes performance worse by adding latency.
How to ensure the secure mode of transfer: SSLSocket or an HTTPS URL.
Any others like this
Any others like what? You will need to be specific.

Do not use serialization to "persist" objects - this makes schema management (i.e. changing what constitutes a class's state) almost unworkable
Always declare a serialVersionUID field; otherwise you will not be able to add methods, or change the class in any way (even non-state-changing ones) without old versions of your code being unable to deserialize the objects (an IncompatibleClassVersionError will be thrown)
try and override readResolve if you are deserializing a logical "enum" instance of a class (typesafe enumeration pattern)
Make sure you are 100% happy with the name of the variables which make up the state of your class. Once you have serialized instances around, you cannot change the variable names
Do not implement Serializable unless you really have to
Do not make your interfaces Serializable - the implementation classes may be, but the interfaces should not be
Do not make serialization part of the way your library passes objects around, unless you are the only producer and consumer of the objects (e.g. server-GUI communication). Use a binary/wire protocol instead (e.g. protobuf)
To minimize what is sent over the wire, you could use swizzling. That is, perhaps you have a Product class; the serialized form might just be a unique int id field. All other methods could then be made to construct relevant state as required (perhaps as a database call, or call to some central service)
Make sure, if you are serializing out an object which contains some collection of elements as part of its state, that you synchronize on the collection. Otherwise you may find that someone modifies the collection as it is being serialized out, resulting in a ConcurrentModificationException

Probably how to make the object in zip format and send over the network etc..
Check out how the game developers for network games implement networking. They know, how to transmit data quickly. Have a look at e.g. http://code.google.com/p/kryonet/
How to ensure the secure mode of transfer. any others like this..
There are a lot of interpretations of secure mode. If you need reliability, use TCP otherwise UDP. If you need encryption use TLS otherwise rot13 may fit. If you need to ensure integrity, append a hash of the values to the message.
How to reduce the size of the object ,
Analyse your data and strip down the objects, so that you only have the necessary data in there. This is very context specific, as the best optimisation can be in the domain. E.g. you can check, if it is possible to send only deltas of the change.
It is an interesting question, but you have to be more specific about your goal or domain to get an answer that fits best.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Optimal diff between object lists in Java - java

JaVers lib (http://javers.org) do the job. Diff diff = javers.compare(list1, list2); Diff contains list of changes like: object-added, object-removed, index-changed

Related

Why LogbackMDC keep track of last operation?

combined vs. separate backend calls

How can I compare 2 large objects running on separate jvm's?

implementing sorted "watchlist" class in java: what instruments to use?

What are the key points to make sure while implementing serialization

Categories

Resources