Partial deserialization and serialization in Java?

Partial deserialization and serialization in Java? - java

There are a huge number of libraries and approaches out there to serialize and de-serialize objects in Java.
What I would like to do involves rather large and complex objects which need to get sent back and forth between processing nodes.
However, each node only is interested in one or a few, usually small parts of the whole object. The processing node processes that part and creates a new part that would need to get spliced into the existing serialized object before it gets sent on.
For this, two things would be of high importance:
being able to just deserialize parts of the serialized object (and thus save parsing/deserialization time, object creation time, memory...) and to also add the serialization of some new part to the existing serialized object (again saving time and memory) -- skipping the unwanted parts in the serialized version should be extremely fast and efficient and should ideally be possible in a streaming mode, without the need to keep the whole serialized data in memory at once
overall compact and fast serialization and deserialization.
I am pretty flexible as to how much automation I get for actually creating typed objects versus untyped maps and lists: if all else fails I would be able to represent the whole object as a nested data structure of just maps, arrays and the basic datatypes boolean, String and Number.
UPDATE: forgot to mention two additional, rather important requirements:
the solution must be possible with the existing objects, i.e. it is not possible to re-implement the current object using a e.g. different collections class.
ideally the solution should be based on open-source software because the software I need this for will be published itself as open-source.

It sounds like you're planning a design where a whole bunch of data is sent to a processing node, and that node will only read/modify/write a small part of it. But then will send the whole bundle on to another node.
Why not have the host that has all the data figure out which node needs which data, and only send that data? Then processing can happen in parallel, instead of daisy-chain. And your total network traffic will be less than every node sending a full copy of everything: O(n*m).
It might be worth designing your own message format, potentially based on JSON, binary, or something else.

Related

GSON - updating .json file

Is there any way to update a value in JSON file without loading the file into object, changing the value and saving again?
It sounds very inefficient
EDIT:
E.g. I want to add an item into some array

Any solution to this problem requires reading through the entire file in some manner. Any off-the-shelf JSON library such as Gson will create objects to represent the textual JSON elements.
To avoid the "overhead" of creating the objects you could create a special-use parser but you'll still need to read the entire file. You could reduce the amount of memory used by successively reading one line, processing it and then writing it to the output file. In the case you describe this will mostly be writing the line back out with not changes. This will be more efficient but may take a while to implement and may be buggy until you get them all worked out. On top of that, if the JSON format changes you may have to change your code and then go through the process of working out the bugs you may have introduced.
If you don't want to write your own parser but the JSON file is large and you don't want to read it all into memory at once, you might check out Gson streaming: https://sites.google.com/site/gson/streaming.
These days, people often use off-the-shelf libraries because they can get the implementation done faster and with fewer bugs, even though it is less efficient when it executes. Unless the inefficiency has a measurable impact on user experience or noticeably affects execution time/memory utilization this practice is probably acceptable, but ymmv.

Storing Large Amounts of Dictionary-Like Data Within an Application in Java

I fear I may not be truly understanding the utility of database software like MySQL, so perhaps this is an easy question to answer.
I'm writing a program that stores and accesses a bestiary for use in the program. It is a stand-alone application, meaning that it will not connect to the internet or a database (which I am under the impression requires a connection to a server). Currently, I have an enormous .txt file that it parses via a simple pattern (Habitat is on every tenth line, starting with the seventh; name is on every tenth line, starting with the first; etc.) This is prone to parsing errors (problems with reading data that is unrecognizable with the specified encoding, as a lot of the data is copy/pasted by lazy data-entry-ists) and I just feel that parsing a giant .txt file every time I want data is horribly inefficient. Plus, I've never seen a deployed program that had a .txt laying around called "All of our important data.txt".
Are databases the answer? Can they be used simply in basic applications like this one? Writing a class for each animal seems silly. I've heard XML can help, too - but I know virtually nothing about it except that its a mark-up language.
In summary, I just don't know how to store large amounts of data within an application. A good analogy would be: How would you store data for a dictionary/encyclopedia application?

So you are saying that a standalone application without internet access cannot have a database connection? Well your Basic assumption that DB cannot exist in standalone apps is wrong. Today's web applications use Browser assisted SQL databases to store data. All you need is to experiment rather than speculate. If you need direction, start with light weight SQLite

While databases are undoubtedly a good idea for the kind of application you're describing, I'll throw another suggestion your way, which might suit you if your data doesn't necessarily need to change at all, and there's not a "huge" amount of it.
Java provides the ability to serialise objects, which you could use to persist and retrieve object instance data directly to/from files. Using this simple approach, you could:
Write code to parse your text file into a collection of serialisable application-specific object instances;
Serialise these instances to some file(s) which form part of your application;
De-serialise the objects into memory every time the application is run;
Write your own Java code to search and retrieve data from these objects yourself, for example using ordered collection structures with custom comparators.
This approach may suffice if you:
Don't expect your data to change;
Do expect it to always fit within memory on the JVMs you're expecting the application will be run on;
Don't require sophisticated querying abilities.
Even if one or more of the above things do not hold, it may still suit you to try this approach, so that your next step could be to use a so-called object-relational mapping tool like Hibernate or Castor to persist your serialisable data not in a file, but a database (XML or relational). From there, you can use the power of some database to maintain and query your data.

What Is More Expensive In Java? Serialization or Writing To a File?

I have a List of Strings I need to store locally (assume the list can run between 10 items to 100 items). I want to know if I should write the lists into a Flat database or use Serialization to flatten the object containing the list? Which is more expensive (CPU-wise)? What are the conditions that make one more expensive than the other?
Thanks!!

Especially since they are Strings, just write them out one per line to a file. Simple, fast, and far easier to test.

I have a List of Strings I need to store locally (assume the list can run between 10 items to 100 items).
Assuming that the total length of the strings is small (e.g. less than 10K), the user-space CPU time used to do the saving is likely to be a few milliseconds using either serialization or a flat file. In other words, it will be so fast that the user won't notice the difference.
You should be looking at the other reasons for choosing between the two alternatives (and others):
How easy is it to write the code.
How many extra dependencies does the alternative pull in.
Human readability / editability of the saved data file ... in case you need to do this.
How easy / hard it would be to change the "schema" of the stuff saved to file ... in case you need to do this.
Whether you can update one string without rewriting the whole file ... if this is relevant.
Support for other things such as atomic update, transactions, complex queries, etc ... if these are relevant.
And if, despite what I said above, you still want to know which will be faster (and by how much), then benchmark it. The real world performance will depend on factors that you haven't specified.
Here are a couple of important references on how to write a Java benchmark so that it gives meaningful results.
How NOT to write a Java micro-benchmark
Robust Java benchmarking, Part 1: Issues.
Robust Java benchmarking, Part 2: Statistics and solutions
And you can experiment to answer this part of your question:
What are the conditions that make one more expensive than the other?
(See above)

I am not sure about the expense but I believe since the object representation many a times contains whole lot of meta data (and structure) which might result in creating a big big object size than the original intended data. Example to this may be when you store a xml structure in a DOM object - it takes about 4X size in memory than the original data.
Based on above, I think serializing as an object might be more expensive. You may also want to consider the consumption of the end product. If you want the produced file to be human readable you will have to serialize the String data for readability.

What is an efficient way to read/write a priority queue to a text file?

I have a priority queue class that I implemented in Java as it being an array of queues. I need a good way (without using Serialization) of recording and storing the contents of the priority queue after each "transaction" or enqueue()/dequeue() of an object from the priority queue. It should serve as a backup in the event that the priority queue needs to be rebuilt by the program from the text file.
Some ideas I had and my problems with each:
After each "transaction", loop through the queues and write each one to a line in the file using delimiters between objects.
-- My problem with this is that it would require dequeueing and re-enqueueing all the objects and this seems highly inefficient.
After each enqueue or dequeue simply write that object or remove that object from the file.
-- My problem with this is: if this is the approach I should be taking, I am having a hard time coming up with a way to easily find and delete the object after being dequeued.
Any hints/tips/suggestions would be greatly appreciated!

To loop through a queue you can just iterate over it. This is non-destructive (but only loosely thread safe)
Writing the contents of the queue to disk every time is likely to be very slow. For a typical hard drive, a small queue will take about 20 ms to write. i.e. 50 times per second at best. If you use an SSD this will be much faster for a small queue, however you still have to marshal your data even if you don't use Serialisation.
An alternative is to use a JMS server which is designed to support transactions, queues and persistence. A typical JMS server can handle about 10,000 messages per second. There are a number of good free servers available.

I would implement your requirements as a log pattern. At the end of your file, append every enqueue and its priority, append every dequeue. If your messaging server crashes, you can replay the log file and you'll end up with the appropriate state.
Obviously, your log file will grow huge over time. To combat this, you'll want to rotate log files every so often. To do this, serialize your queue at a point in time, and then begin logging in a new file. You can even accomplish this without locking the state (freezing queu requests) by simultaneously logging transactions to the old and new logs while a snapshot of the data structure is written to disk. When the snapshot is complete, write a pointer indicating that to disk and you can delete your old log.
Write time and space is n, replays should be rare and are relatively fast.

To find objects easily in second approach...I've couple of suggestions ::
You can use your priority function to keep objects sorted in the file.
To manage newly added objects at different positions, keep some space between every inserted object in the text file and when an object is inserted, you can use some pointer like behavior to specify the offset or something else which can be easily managed.
Use a buffer since writing content evreytime can be very slow.
Deletion will be trivial if you use your priority function carefully.
Also sorting in small buckets pointed by pointers will be very fast and you can always use a garbage collection type of behavior by compacting all the objects after sometime.

one more suggestions: (to consider if usage one file exactly is not a must):
If your object number is not very large, store each object to a seperate file. Of'course, you will need to make a unique identifier for each object and you can use this identifier to be the file name too. this way, you always add or delete a single file based on the identifier stored in the object. If the objects are of various classes that can't be modified, you simply can store a hashmap that maps identifiers to objects. so before you add an object to a queue, you create an identifier and then add the object and the identifier to the map as a pair and you write a new file names as the identifier and containing the object. I leave what to do on delete and reload as it is nothing more than practice.
personally, I favour what was suggested by Robert Harvey in his comment on the question. consider the use of a database, especially if your project has one already. this will make storing objects and deleting objects easier and faster than locating positions within a file. because even if you find a location of the object in a file, most probably you will need to write the whole file again (only without that object). and that is not different from looping. using a database, you avoid all of this trouble.

What is the purpose of Serialization in Java?

I have read quite a number of articles on Serialization and how it is so nice and great but none of the arguments were convincing enough. I am wondering if someone can really tell me what is it that we can really achieve by serializing a class?

Let's define serialization first, then we can talk about why it's so useful.
Serialization is simply turning an existing object into a byte array. This byte array represents the class of the object, the version of the object, and the internal state of the object. This byte array can then be used between JVM's running the same code to transmit/read the object.
Why would we want to do this?
There are several reasons:
Communication: If you have two machines that are running the same code, and they need to communicate, an easy way is for one machine to build an object with information that it would like to transmit, and then serialize that object to the other machine. It's not the best method for communication, but it gets the job done.
Persistence: If you want to store the state of a particular operation in a database, it can be easily serialized to a byte array, and stored in the database for later retrieval.
Deep Copy: If you need an exact replica of an Object, and don't want to go to the trouble of writing your own specialized clone() class, simply serializing the object to a byte array, and then de-serializing it to another object achieves this goal.
Caching: Really just an application of the above, but sometimes an object takes 10 minutes to build, but would only take 10 seconds to de-serialize. So, rather than hold onto the giant object in memory, just cache it out to a file via serialization, and read it in later when it's needed.
Cross JVM Synchronization: Serialization works across different JVMs that may be running on different architectures.

While you're running your application, all of its objects are stored in memory (RAM). When you exit, that memory gets reclaimed by the operating system, and your program essentially 'forgets' everything that happened while it was running. Serialization remedies this by letting your application save objects to disk so it can read them back the next time it starts. If your application is going to provide any way of saving/sharing a previous state, you'll need some form of serialization.

I can share my story and I hope it will give some ideas why serialization is necessary. However, the answers to your question are already remarkably detail.
I had several projects that need to load and read a bunch of text files. The files contained stop words, biomedical verbs, biomedical abbreviations, words semantically connected to each other, etc. The contents of these files are simple: words!
Now for each project, I needed to read the words from each of these files and put them into different arrays; as the contents of the file never changed, it became a common, however redundant, task after the first project.
So, what I did is that I created one object to read each of these files and to populate individual arrays (instance variables of the objects). Then I serialized the objects and then for the later projects, I simply deserialized them. I didn't have to read the files and populate the arrays again and again.

In essense:
Serialization is the process of
converting a set of object instances
that contain references to each other
into a linear stream of bytes, which
can then be sent through a socket,
stored to a file, or simply
manipulated as a stream of data
See uses from Wiki:
Serialization has a number of advantages. It provides:
a method of persisting objects which
is more convenient than writing
their properties to a text file on
disk, and re-assembling them by
reading this back in.
a method of
issuing remote procedure calls,
e.g., as in SOAP
a method for
distributing objects, especially in
software componentry such as COM,
CORBA, etc.
a method for detecting
changes in time-varying data.

The most obvious is that you can transmit the serialized class over a network,
and the recepient can construct a duplicate of the original instanstance. Likewise,
you can save a serialized structure to a file system.
Also, note that serialization is recursive, so you can serialize an entire heterogenous
data structure in one swell foop, if desired.

Serialized objects maintain state in space, they can be transferred over the network, file system, etc... and time, they can outlive the JVM that created them.
Sometimes this is useful.

I use serialized objects to standardize the arguments I pass to functions or class constructors. Passing one serialized bean is much cleaner than a long list of arguments. The result is code that is easier to read and debug.

For the simple purpose of learning (notice, I said learning, I did not say best, or even good, but just for the sake of understanding stuff), you could save your data to a text file on the computer, then have a program that reads that info, and based on the file, you could have your program respond differently. If you were more advanced, it wouldn't necessarily have to be a txt file, but something else.
Serializing on the other hand, puts things directly into computer language. It's like you're telling a Spanish computer something in Spanish, rather than telling it something in French, forcing it to learn French, then save things into its native Spanish by translating everything. Not the most tech-intensive answer, I'm just trying to create an understandable example in a common language format.
Serialization is also faster, because in Java, objects are handled on the heap, and take much longer than if they were represented as primitives on the stack. Speed, speed, speed. And less file processing from a programmer point of view.

One of the classical example where serialization is used in daily life is "Save Game" option in any computer games. When player decides save his progress in the game then the application writes the saved state of the game into a file via serialization and when player "Load Game" the serialized file is read and Game state is re-created.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.