Clarification Between Marshalling and Serialization

Clarification Between Marshalling and Serialization - java

I've seen several posts and topics regarding marshaling and serialization and I'm looking to gain some additional understanding/clarify my thought process.
I read What is the difference between Serialization and Marshaling? and a lot of the responses show that they are synonymy in a sense. But I think there may be some differences which I'm trying to clarify.
My understanding is that java serialization takes an object and makes it into a binary stream which can then be deserialized, as shown in the following example http://www.tutorialspoint.com/java/java_serialization.htm
For marshaling/demarshaling, I've seen classes get converted into an xml representation of the bean and have the information passed between a client and server, then recreated on the other end.
Based on the above my question(s) are:
Does serialization always go to binary format? If so, do we have to worry about different machine architectures like Big Indian vs. Little Indian or does java handle this for us?
If we represent our data over the wire as xml or json, is this always referred to as marshaling/demarshaling?
If the above bullets are true, then is there an advantage to one over the other?

I think it is just a matter of taste and context.
Most times I encounter the term either of the terms means that you want to turn an object into a string of 0 and 1.
But sometimes a specification might attach a slightly different meaning to it.
See the java case on wikipedia.
http://en.wikipedia.org/wiki/Marshalling_(computer_science)

Related

Java `json_decode` (PHP) equivalent

I'm coming to Java from a PHP background, and am surprised to see that JSON to object conversion is so constrained. In all the Jackson tutorials I came across, it looks like the object to be read needs to be pre-defined. Thus, if my data is in, say, JSON API format, I need to write boilerplate code to strip out everything except the "data" part, and then somehow convert all the strings into objects one by one.
I really miss PHP's json_decode function, which will read any JSON and give you a PHP object to play with. It also builds the necessary structure into the object, adding arrays and sub-objects as needed. Of course I understand that Java is a compiled language, but I'm wondering how this can be made easier.

As a strongly typed language Java often has less of these "just give it to me"-type of functionalities, but that doesn't mean they don't exist. Even Jackson can deserialize JSON without a predefined schema, giving you Maps and Lists instead of domain objects.
Just remember that if you're working on "real" projects, there are plenty of advantages from having the schemas defined. They weren't invented to annoy you, but to make sure that you can trust your data being in the correct form (or find out early if it's not).

Differences between Java Serialization, JSON, JAXB?

Is an object's implementation of the Serializable interface in any way related to that object's ability to be serialized into JSON or XML?
Is there a name for the text format that Java serialization uses?
If not, should we not use the word "serialization" to describe exporting an object to JSON or XML, to avoid confusion?
At AO, what uses are typical for each of these three serialization methods?
I know that JAXB is usually used to convert XML to Java, not the other way around, but I heard that the reverse is possible too.

Serialization simply refers to exporting an object from a process-specific in-memory format to an inter-process format that can be read and understood by a different process or application. It may be text, or it may be binary, it doesn't matter. It's all serialization. The reverse processes (reading and parsing a serialized inter-process format into an in-memory, in-process format) is called deserialization.
In that sense, serializing an object into an ObjectStream is just as much serialization as serializing it to JSON or XML. ObjectStream serialization is very difficult to understand/parse by non-java (including humans. It is not "human-readable"), but is used because it can be done on pretty much any object without any special markup.
JSON/XML on the other hand require extra work to tell the parser how to map them to and from JSON/XML, but are very portable - pretty much every language can understand JSON/XML, including humans - it is "human-readable".

One purpose of serialization of Java objects is being able to write them to a (binary) file from which some Java program can read them back, getting the same objects into its memory. This usage is usually limited to Java applications, writing and reading, although some non-Java app might be written to understand the binary format.
Another frequently used serialization of Java objects is to write them to a text (or binary) file from which some (note the absence of: Java) program can read and reconstruct an object or data structure equivalent to the POJO. This, of course, also works in the reverse direction. (I'm adding "binary", because there are some binary formats not defined by Java that are architecture-independent, e.g., ASN.1.)
And, yes, JAXB works either way, but there are some difficulties if the XML is rather "outlandish", i.e., far away from what JAXB can handle or handle easily. But if you can design either the XML Schema or the Java classes, it works very well. JAXB being part of the JDK, you might prefer using it over other serializations if you need to go from Java to X or back. There are other languange binding for XML.

Formal or Practical XML Tag Length Limit?

I've not managed to find any mention of a limit to xml tag length on the web. I'm looking to build XML Schemas that act as a specification for 3rd parties to send data to us.
The Schema (and the data) are supposed to conform to our custom ontology/data dictionary thingy which is hierarchical and user-customizable.
The natural mapping is for nodes in the hierarchy to be used to name types and tags in the XSD/XML. Because however leaf node names in the ontology do not have to be unique, I am considering encoding the full path of nodes in the hierarchy as the tag name, suitably mangled for XML lexical rules.
So if my ontology has multiple 'lisa' nodes meaning different things as they are at different places in the hierarchy I could use the full path to the nodes to generate different XML types/tag names, so you can have
<abe_homer_lisa> simpsons lisa ... </abe_homer_lisa>
<applei_appleii_lisa> ... apple lisa </applei_appleii_lisa>
<mona_lisa> and paintings </mona_lisa>
... data for any of the different 'lisa' types in the same file without ambiguity.
I can't find anything on the web that specifies a maximum tag length (or a minimum supported tag length for standards-compliant engines). (Good summary of the lexical rules for XML here)
The same thing was asked about attribute length and if the standard specifies no limit for attributes then I doubt there's one for tags, but there may be a practical limit.
I suspect even a practical limit would be vastly bigger than my needs (I would expect things to be smaller than 255 chars most of the time); basically if the Java XML processors, standard ETL tools and the common XSLT processors can all handle tags much bigger than this then it won't be an issue.

I think you're unlikely to find tools that can't handle names of say 1K characters, at which point you're hitting serious performance and usability problems rather than hard limits.
But your design is wrong. XML is hierarchic, take advantage of the fact rather than trying to fight it.

There is no limit to tag name lengths that I know of but there can be some implementation limits depending on the tool that tries to parse the XML even if the XML specification may not mention any limits.
On the other hand why not use XML's native & inherently hierarchical structure. Why encode everything as <abe_homer_lisa> instead of encoding it as:
<abe>
<homer>
<lisa>simpsons lisa</lisa>
</homer>
</abe>
<applei>
<appleii>
<lisa> ... apple lisa </lisa>
</applei>
</appleii>

I would strongly suggest to use an established XML mechanism to distinguish elements, namely to use namespaces. That way you would have e.g.
<lisa xmlns="http://example.com/simpsons">..</lisa>
<lisa xmlns="http://example.com/apple">...</lisa>
Both the W3C schema language as well as XSLT and XPath fully support namespaces.

Based on the comments of Michael Kay (something of an expert on XML) and Mihai Stancu above I'd say the answer to my original question was:
No official limit
Tools likely to support 1000+ chars as an absolute minimum
Likely to hit problems in performance [given an XML tool processing those files would have to do lots of string indexing and comparison on very long strings] and usability way before then
XML namespaces and/or using the structure of the document tree to provide discriminating context would probably be better ways of "uniquifying" tag names
I was after an answer to that very specific question about legal tag length, and since I found the same question asked about attribute length but not tags I thought it might be worth having "an" answer around in case someone else googles it. Thanks to all respondents. Valid points about whether my design was sensible; I'll explain the rationale elsewhere.

Thanks to those who pointed out there might be more sensible ways to address the underlying problem (ensuring types/tag names in an XML schema are unique).
Re using a hierarchy of nodes to provide the context:
I agree this would generally be appropriate. However (I didn't really explain my precise problem domain in the q) in this particular case, the user-configurable grouping of items in the tree-structure data dictionary I have to deal with is pretty arbitrary and has almost nothing to do with relationships in the data that the dictionary describes.
So in the
<abe>
<homer>
<lisa>lisa1</lisa>
</homer>
</abe>
example should another lisa node be under the same homer node, or a different one? Should the homers be under the same abe node or not? In the case of the data in question, the distinction is more or less meaningless: it would be like grouping data according to the page of an index it happened to be referenced on in a particular book. I suppose I could just make an arbitrary call and lock it down in the XSD.
If using something like XSL to extract data then it wouldn't matter, //abe/homer/lisa would get all of the lisa nodes irrespective of how they were grouped together. In practice someone is likely to be generating these from CSV files or whatever so I'd prefer as flat a structure as possible.
Ditto for namespaces: although they're designed for this very purpose (providing context for a name and ensuring that accidental clashes do not cause ambiguity when different types of data are bundled together in a file), in practice they'd add an extra layer of complexity to whoever generates the data from source systems.
In my precise circumstances, I expect name clashes in this arbitrary grouping to be pretty unlikely (and reflect poor usage), and hence just need reasonable handling, without imposing an undue penalty on the majority case.

Contrary to conventional wisdom, I would strongly advise against using the so-called XML Namespaces mechanism. Over the longer haul, it will cause you pain. Just say no to namespaces. You do not need them.
Your intuition that elements can be distinguished by their context - represented, in this case, by their "paths" - is correct. However, your idea of encoding the entire path into the name of an element may not be optimal. Consider instead using the simple name, along with an attribute to hold the context or path. (Name this attribute 'context' or 'path' or anything more evocative!) This will be enough to distinguish the meanings.[*]
For varying content models, you can use a variant of the same technique. Give each different type a circumstantially convenient name, and record the "real" name in another attribute named, say 'ontology'.
As for your question, the XML spec does not place any inherent limitation on the length of names, although for purely technical reasons you may find a limit of 65536 characters quoted in some places. That same "limitation" may also apply to the length of attribute value literals. At an average of 20 characters per atomic name, 20 levels of hierarchy would still amount to fewer than 500 bytes for a path, so you probably have little to worry about.
[*] Note: this technique is actually very old, but almost completely forgotten in XML mindspace. In HTML, for example, there is a single element type named INPUT to cover all sorts of GUI controls, and yet there is no confusion, thanks to the 'type' attribute.

Serialization framework (no no-arg constructor)

I'm looking for some info on the best approach serialize a graph of object based on the following (Java):
Two objects of the same class must be binary equal (bit by bit) compared to true if their state is equal. (Must not depend on JVM field ordering).
Collections are only modeled with arrays (nothing Collections).
All instances are immutable
Serialization format should be in byte[] format instead of text based.
I am in control of all the classes in the graph.
I don't want to put an empty constructor in the classes just to support serialization.
I have looked at implementing a solution based my own traversal an on Objenisis but my problem does not seem that unique. Better checking for any existing/complete solution first.
Updated details:
First, thanks for your help!
Objects must serialize to exactly the same bit order based on the objects state. This is important since the binary content will be digitally signed. Reconstruction of the serialized format will be based on the state of the object and not that the original bits are stored.
Interoperability between different technologies is important. I do see the software running on ex. .Net in the future. No Java flavour in the serialized format.
Note on comments of immutability: The values of the arrays are copied from the argument to the inner fields in the constructor. Less important.
Best regards,
Niclas Lindberg

You could write the data yourself, using reflections or hand coded methods. I use methods which are look hand code, except they are generated. (The performance of hand coded, and the convience of not having to rewrite the code when it changes)
Often developers talk about the builtin java serialization, but you can have a custom serialization to do whatever you want, any way you want.
To give you are more detailed answer, it would depend on what you want to do exactly.
BTW: You can serialize your data into byte[] and still make it human readable/text like/editable in a text editor. All you have to do is use a binary format which looks like text. ;)

Maybe you want to familiarize yourself with the serialization frameworks available for Java. A good starting point for that is the thift-protobuf-compare project, whose name is misleading: It compares the performance of more than 10 ways of serializing data using Java.
It seems that the hardest constraint you have is Interoperability between different technologies. I know that Googles Protobuffers and Thrift deliver here. Avro might also fit.

The important thing to know about serialization is that it is not guaranteed to be consistent across multiple versions of Java. It's not meant as a way to store data on a disk or anywhere permanent.
It's used internally to send classes from one JVM to another during RMI or some other network protocol. These are the types of applications that you should use Serialization for. If this describes your problem - short term communication between two different JVM's - then you should try to get Serialization going.
If you're looking for a way to store the data more permanently or you will need the data to survive in forward versions of Java, then you should find your own solution. Given your requirements, you should create some sort of method of converting each object into a byte stream yourself and reading it back into objects. You will then be responsible for making sure the format is forward compatible with future objects and features.
I highly recommend Chapter 11 of Effective Java by Joshua Bloch.

Is the Externalizable interface what you're looking for ? You fully control the way your objects are persisted and you do that the OO-style, with methods that are inherited and all (unlike the private read-/write-Object methods used with Serializable). But still, you cannot get rid of the no-arg accessible constructor requirement.

The only way you would get this is:
A/ USE UTF8 text, I.E. XML or JSON, binary turned to base64(http/xml safe variety).
B/ Enforce UTF8 binary ordering of all data.
C/ Pack the contents except all unescaped white space.
D/ Hash the content and provide that hash in a positionally standard location in the file.

How does Java's serialization work and when it should be used instead of some other persistence technique?

I've been lately trying to learn more and generally test Java's serialization for both work and personal projects and I must say that the more I know about it, the less I like it. This may be caused by misinformation though so that's why I'm asking these two things from you all:
1: On byte level, how does serialization know how to match serialized values with some class?
One of my problems right here is that I made a small test with ArrayList containing values "one", "two", "three". After serialization the byte array took 78 bytes which seems awfully lot for such low amount of information(19+3+3+4 bytes). Granted there's bound to be some overhead but this leads to my second question:
2: Can serialization be considered a good method for persisting objects at all? Now obviously if I'd use some homemade XML format the persistence data would be something like this
<object>
<class="java.util.ArrayList">
<!-- Object array inside Arraylist is called elementData -->
<field name="elementData">
<value>One</value>
<value>Two</value>
<value>Three</value>
</field>
</object>
which, like XML in general, is a bit bloated and takes 138 bytes(without whitespaces, that is). The same in JSON could be
{
"java.util.ArrayList": {
"elementData": [
"one",
"two",
"three"
]
}
}
which is 75 bytes so already slightly smaller than Java's serialization. With these text-based formats it's of course obvious that there has to be a way to represent your basic data as text, numbers or any combination of both.
So to recap, how does serialization work on byte/bit level, when it should be used and when it shouldn't be used and what are real benefits of serialization besides that it comes standard in Java?

I would personally try to avoid Java's "built-in" serialization:
It's not portable to other platforms
It's not hugely efficient
It's fragile - getting it to cope with multiple versions of a class is somewhat tricky. Even changing compilers can break serialization unless you're careful.
For details of what the actual bytes mean, see the Java Object Serialization Specification.
There are various alternatives, such as:
XML and JSON, as you've shown (various XML flavours, of course)
YAML
Facebook's Thrift (RPC as well as serialization)
Google Protocol Buffers
Hessian (web services as well as serialization)
Apache Avro
Your own custom format
(Disclaimer: I work for Google, and I'm doing a port of Protocol Buffers to C# as my 20% project, so clearly I think that's a good bit of technology :)
Cross-platform formats are almost always more restrictive than platform-specific formats for obvious reasons - Protocol Buffers has a pretty limited set of native types, for example - but the interoperability can be incredibly useful. You also need to consider the impact of versioning, with backward and forward compatibility, etc. The text formats are generally hand-editable, but tend to be less efficient in both space and time.
Basically, you need to look at your requirements carefully.

The main advantage of serialization is that it is extremely easy to use, relatively fast, and preserves actual Java object meshes.
But you have to realize that it's not really meant to be used for storing data, but mainly as a way for different JVM instances to communicate over a network using the RMI protocol.

see the Java Object Serialization Stream Protocol for a description of the file format an grammar used for serialized objects.
Personally I think the built-in serialization is acceptable to persist short-lived data (e.g. store the state of a session object between to http-requests) which is not relevant outside your application.
For data that has a longer live-time or should be used outside your application, I'd persist either into a database or at least use a more commonly used format...

How does Java's built-in serialization works?
Whenever we want to serialize an object, we implement java.io.Serializable interface. The interface which does not have any methods to implement, even though we are implementing it to indicate something to compiler or JVM (known as Marker Interface). So if JVM sees a Class is Serializable it perform some pre-processing operation on those classes. The operation is, it adds the following two sample methods.
private void writeObject(java.io.ObjectOutputStream stream)
throws IOException {
stream.writeObject(name); // object property
stream.writeObject(address); // object property
}
private void readObject(java.io.ObjectInputStream stream)
throws IOException, ClassNotFoundException {
name = (String) stream.readObject(); // object property
address = (String) stream.readObject();// object property
}
When it should be used instead of some other persistence technique?
The built in Serialization is useful when sender and receiver both are Java. If you want to avoid the above kind of problems, we use XML or JSON with the help of frameworks.

I bumped into this dilemma about a month ago (see the question I asked).
The main lesson I learned from it is use Java serialization only when necessary and if there's no other option. Like Jon said, it has it's downfalls, while other serialization techniques are much easier, faster and more portable.

Serializing means that you put your structured data in your classes into a flat order of bytecode to save it.
You should generally use other techniques than the buildin java-method, it is just made to work out of the box but if you have some changing contents or changing orders in future in your serialized classes, you get into trouble because you'll cannot load them correctly.

The advantage of Java Object Serialization (JOS) is that it just works. There are also tools out there that do the same as JOS, but use an XML format instead of a binary format.
About the length: JOS writes some class information at the start, instead of as part of each instance - e.g. the full field names are recorded once, and an index into that list of names is used for instances of the class. This makes the output longer if you write only one instance of the class, but is more efficient if you write several (different) instances of it. It's not clear to me if your example actually uses a class, but this is the general reason why JOS is longer than one would expect.
BTW: this is incidental, but I don't think JSON records class names (as you have in your example), and so it might not do what you need.

The reason why storing a tiny amount of information is serial form is relatively large is that it stores information about the classes of the objects it is serialising. If you store a duplicate of your list, then you'll see that the file hasn't grown by much. Store the same object twice and the difference is tiny.
The important pros are: relatively easy to use, quite fast and can evolve (just like XML). However, the data is rather opaque, it is Java-only, tightly couples data to classes and untrusted data can easily cause DoS. You should think about the serialised form, rather than just slapping implements Serializable everywhere.

If you don't have too much data, you can save objects into a java.util.Properties object. An example of a key/value pair would be user_1234_firstname = Peter. Using reflection to save and load objects can make things easier.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.