Is there any way to update a value in JSON file without loading the file into object, changing the value and saving again?
It sounds very inefficient
EDIT:
E.g. I want to add an item into some array
Any solution to this problem requires reading through the entire file in some manner. Any off-the-shelf JSON library such as Gson will create objects to represent the textual JSON elements.
To avoid the "overhead" of creating the objects you could create a special-use parser but you'll still need to read the entire file. You could reduce the amount of memory used by successively reading one line, processing it and then writing it to the output file. In the case you describe this will mostly be writing the line back out with not changes. This will be more efficient but may take a while to implement and may be buggy until you get them all worked out. On top of that, if the JSON format changes you may have to change your code and then go through the process of working out the bugs you may have introduced.
If you don't want to write your own parser but the JSON file is large and you don't want to read it all into memory at once, you might check out Gson streaming: https://sites.google.com/site/gson/streaming.
These days, people often use off-the-shelf libraries because they can get the implementation done faster and with fewer bugs, even though it is less efficient when it executes. Unless the inefficiency has a measurable impact on user experience or noticeably affects execution time/memory utilization this practice is probably acceptable, but ymmv.
Related
I am making an application in Java which uses files to store information with serialization. The trouble I ran into was that everytime I update one of my classes thats being store I obviously get InvalidClassException. The process I am following for now is that I just rm all the files and rebuild them. Obviously thats tidious with 5 Users,and I couldnt continue it with 10. Whats the standard best practice when updating Serialized objects to not lose the information from the files?
Mostly?
Stop using java's baked-in serialization. It sucks. This isn't just an opinion - the OpenJDK engineers themselves routinely raise fairly serious eyebrows when the topic of java's baked in serialization mechanism (ObjectInputStream / ObjectOutputStream comes up). In particular:
It is binary.
It is effectively unspecified; you will never be reading or writing it with anything other than java code.
Assuming multiple versions are involved (and it is - that's what your question is all about), it is extremely convoluted and requires advanced java knowledge to try to spin together some tests to ensure that things are backwards/forwards compatible as desired.
The format is not particularly efficient even though it is binary.
The API is weirdly un-java-like (with structural typing even, that's.. bizarre).
So what should I do?
You use an explicit serializer: A library that you include which does the serialization. There are many options. You can use GSON or Jackson to turn your object into a JSON string, and then store that. JSON is textual, fairly easy to read, and can be read and modified by just about any language. Because you 'control' what happens, its a lot simpler to tweak the format and define what is supposed to happen (e.g. if you add a new field in the code, you can specify what the default should be in your Jackson or GSON annotations, and that's the value that you get when you read in a file written with a version of your class that didn't have that field).
JSON is not efficient on disk at all, but its trivial to wrap your writes and reads with GZipInputStream / GZipOutputStream if that's an issue.
An alternative is protobuf. It is more effort but you end up with a binary data format that is fairly compact even if not compressed, and can still be read and written to from many, many languages, and which also parses way faster (this is irrelevant, computers are so fast, the bottleneck will be network or disk, but, if you're reading this stuff on battery-powered raspberry pis or what not, it matters).
I really want to stick with java's baked-in serialization
Read the docs, then. The specific part you want here is what serialVersionUID is all about, but there are so many warts and caveats, you should mostly definitely not just put an svuid in and move on with life - you'll run into the next weird bug in about 5 seconds. Read it all, experiment, attempt to understand it fully.
Then give up, realize it's a mess and ridiculously complicated to test properly, and use one of the above options.
I'm creating a task to parse two large XML files and find 1-1 relation between elements. I am completely unable to keep whole file in memory and I have to "jump" in my file to check up to n^2 combinations.
I am wondering what approach may I take to navigate between nodes without killing my machine. I did some reading on StAX and I liked the idea but cursor moves one way only and I will have to go back to check different possibilities.
Could you suggest me any other possibility? I need one with commercial use allowance.
I'd probably consider reading the first file into some sort of structured cache and then read the 2nd XML document, referencing against this cache (the cache could actually be a DB - it doesn't need to be in memory).
Otherwise there's no real solution (that I know of) unless you could read the whole file into memory. This ought to perform better too rather than going back and forth across the DOM of an XML document.
One solution would be an XML database. These usually have good join optimizers so as well as saving memory they may be able to avoid the O(n^2) elapsed time.
Another solution would be XSLT, using xsl:key to do "manual" optimization of the join logic.
If you explain the logic in more detail there may turn out to be other solutions using XSLT 3.0 streaming.
I need to parse (and transform and write) a large binary file (larger than memory) in Java. I also need to do so as efficiently as possible in a single thread. And, finally, the format being read is very structured, so it would be good to have some kind of parser library (so that the code is close to the complex specification).
The amount of lookahead needed for parsing should be small, if that matters.
So my questions are:
How important is nio v io for a single threaded, high volume application?
Are there any good parser libraries for binary data?
How well do parsers support streaming transformations (I want to be able to stream the data being parsed to some output during parsing - I don't want to have to construct an entire parse tree in memory before writing things out)?
On the nio front my suspicion is that nio isn't going to help much, as I am likely disk limited (and since it's a single thread, there's no loss in simply blocking). Also, I suspect io-based parsers are more common.
Let me try to explain if and how Preon addresses all of the concerns you mention:
I need to parse (and transform and write) a large binary file (larger
than memory) in Java.
That's exactly why Preon was created. You want to be able to process the entire file, without loading it into memory entirely. Still, the program model gives you a pointer to a data structure that appears to be in memory entirely. However, Preon will try to load data as lazily as it can.
To explain what that means, imagine that somewhere in your data structure, you have a collection of things that are encoded in a binary representation with a constant size; say that every element will be encoded in 20 bytes. Then Preon will first of all not load that collection in memory at all, and if you're grabbing data beyond that collection, it will never touch that region of your encoded representation at all. However, if you would pick the 300th element of that collection, it would (instead of decoding all elements up to the 300th element), calculate the offset for that element, and jump there immediately.
From the outside, it is as though you have a reference to a list that is fully populated. From the inside, it only goes out to grab an element of the list if you ask for it. (And forget about it immediately afterward, unless you instruct Preon to do things differently.)
I also need to do so as efficiently as possible in a single thread.
I'm not sure what you mean by efficiently. It could mean efficiently in terms of memory consumption, or efficiently in terms of disk IO, or perhaps you mean it should be really fast. I think it's fair to say that Preon aims to strike a balance between an easy programming model, memory use and a number of other concerns. If you really need to traverse all data in a sequential way, then perhaps there are ways that are more efficient in terms of computational resources, but I think that would come at the cost of "ease of programming".
And, finally, the format being read is very structured, so it would be
good to have some kind of parser library (so that the code is close to
the complex specification).
The way I implemented support for Java byte code, is to just read the byte code specification, and then map all of the structures they mention in there directly to Java classes with annotations. I think Preon comes pretty close to what you're looking for.
You might also want to check out preon-emitter, since it allows you to generate annotated hexdumps (such as in this example of the hexdump of a Java class file) of your data, a capability that I haven't seen in any other library. (Hint: make sure you hover with your mouse over the hex numbers.)
The same goes for the documentation it generates. The aim has always been to mak sure it creates documentation that could be posted to Wikipedia, just like that. It may not be perfect yet, but I'm not unhappy with what it's currently capable of doing. (For an example: this is the documentation generated for Java's class file specification.)
The amount of lookahead needed for parsing should be small, if that matters.
Okay, that's good. In fact, that's even vital for Preon. Preon doesn't support lookahead. It does support looking back though. (That is, sometimes part the encoding mechanism is driven by data that was read before. Preon allows you to declare dependencies that point back to data read before.)
Are there any good parser libraries for binary data?
Preon! ;-)
How well do parsers support streaming transformations (I want to be
able to stream the data being parsed to some output during parsing - I
don't want to have to construct an entire parse tree in memory before
writing things out)?
As I outlined above, Preon does not construct the entire data structure in memory before you can start processing it. So, in that sense, you're good. However, there is nothing in Preon supporting transformations as first class citizens, and it's support for encoding is limited.
On the nio front my suspicion is that nio isn't going to help much, as
I am likely disk limited (and since it's a single thread, there's no
loss in simply blocking). Also, I suspect io-based parsers are more
common.
Preon uses NIO, but only it's support for memory mapped files.
On NIO vs IO you are right, going with IO should be the right choice - less complexity, stream oriented etc.
For a binary parsing library - checkout Preon
Using a Memory Mapped File you can read through it without worrying about your memory and it's fast.
I think you are correct re NIO vs IO unless you have little endian data as NIO can read little endian natively.
I am not aware of any fast binary parsers, generally you want to call the NIO or IO directly.
Memory mapped files can help with writing from a single thread as you don't have to flush it as you write. (But it can be more cumbersome to use)
You can stream the data how you like, I don't forsee any problems.
I have a List of Strings I need to store locally (assume the list can run between 10 items to 100 items). I want to know if I should write the lists into a Flat database or use Serialization to flatten the object containing the list? Which is more expensive (CPU-wise)? What are the conditions that make one more expensive than the other?
Thanks!!
Especially since they are Strings, just write them out one per line to a file. Simple, fast, and far easier to test.
I have a List of Strings I need to store locally (assume the list can run between 10 items to 100 items).
Assuming that the total length of the strings is small (e.g. less than 10K), the user-space CPU time used to do the saving is likely to be a few milliseconds using either serialization or a flat file. In other words, it will be so fast that the user won't notice the difference.
You should be looking at the other reasons for choosing between the two alternatives (and others):
How easy is it to write the code.
How many extra dependencies does the alternative pull in.
Human readability / editability of the saved data file ... in case you need to do this.
How easy / hard it would be to change the "schema" of the stuff saved to file ... in case you need to do this.
Whether you can update one string without rewriting the whole file ... if this is relevant.
Support for other things such as atomic update, transactions, complex queries, etc ... if these are relevant.
And if, despite what I said above, you still want to know which will be faster (and by how much), then benchmark it. The real world performance will depend on factors that you haven't specified.
Here are a couple of important references on how to write a Java benchmark so that it gives meaningful results.
How NOT to write a Java micro-benchmark
Robust Java benchmarking, Part 1: Issues.
Robust Java benchmarking, Part 2: Statistics and solutions
And you can experiment to answer this part of your question:
What are the conditions that make one more expensive than the other?
(See above)
I am not sure about the expense but I believe since the object representation many a times contains whole lot of meta data (and structure) which might result in creating a big big object size than the original intended data. Example to this may be when you store a xml structure in a DOM object - it takes about 4X size in memory than the original data.
Based on above, I think serializing as an object might be more expensive. You may also want to consider the consumption of the end product. If you want the produced file to be human readable you will have to serialize the String data for readability.
I need to serialize a huge amount of data (around 2gigs) of small objects into a single file in order to be processed later by another Java process. Performance is kind of important. Can anyone suggest a good method to achieve this?
Have you taken a look at google's protocol buffers? Sounds like a use case for it.
I don't know why Java Serialization got voted down, it's a perfectly viable mechanism.
It's not clear from the original post, but is all 2G of data in the heap at the same time? Or are you dumping something else?
Out of the box, Serialization isn't the "perfect" solution, but if you implement Externalizable on your objects, Serialization can work just fine. Serializations big expense is figuring out what to write and how to write it. By implementing Externalizable, you take those decisions out of its hands, thus gaining quite a boost in performance, and a space savings.
While I/O is a primary cost of writing large amounts of data, the incidental costs of converting the data can also be very expensive. For example, you don't want to convert all of your numbers to text and then back again, better to store them in a more native format if possible. ObjectStream has methods to read/write the native types in Java.
If all of your data is designed to be loaded in to a single structure, you could simply do ObjectOutputStream.writeObject(yourBigDatastructure), after you've implemented Externalizable.
However, you could also iterate over your structure and call writeObject on the individual objects.
Either way, you're going to need some "objectToFile" routine, perhaps several. And that's effectively what Externalizable provides, as well as a framework to walk your structure.
The other issue, of course, is versioning, etc. But since you implement all of the serialization routines yourself, you have full control over that as well.
A simplest approach coming immediately to my mind is using memory-mapped buffer of NIO (java.nio.MappedByteBuffer). Use the single buffer (approximately) corresponding to the size of one object and flush/append them to the output file when necessary. Memory-mapped buffers are very effecient.
Have you tried java serialization? You would write them out using an ObjectOutputStream and read 'em back in using an ObjectInputStream. Of course the classes would have to be Serializable. It would be the low effort solution and, because the objects are stored in binary, it would be compact and fast.
I developped JOAFIP as database alternative.
Apache Avro might be also usefull. It's designed to be language independent and has bindings for the popular languages.
Check it out.
protocol buffers : makes sense. here's an excerpt from their wiki : http://code.google.com/apis/protocolbuffers/docs/javatutorial.html
Getting More Speed
By default, the protocol buffer compiler tries to generate smaller files by using reflection to implement most functionality (e.g. parsing and serialization). However, the compiler can also generate code optimized explicitly for your message types, often providing an order of magnitude performance boost, but also doubling the size of the code. If profiling shows that your application is spending a lot of time in the protocol buffer library, you should try changing the optimization mode. Simply add the following line to your .proto file:
option optimize_for = SPEED;
Re-run the protocol compiler, and it will generate extremely fast parsing, serialization, and other code.
You should probably consider a database solution--all databases do is optimize their information, and if you use Hibernate, you keep your object model as is and don't really even think about your DB (I believe that's why it's called hibernate, just store your data off, then bring it back)
If performance is very importing then you need write it self. You should use a compact binary format. Because with 2 GB the disk I/O operation are very important. If you use any human readable format like XML or other scripts you resize the data with a factor of 2 or more.
Depending on the data it can be speed up if you compress the data on the fly with a low compression rate.
A total no go is Java serialization because on reading Java check on every object if it is a reference to an existing object.