Deserialization of Untrusted Data from XML or JSON

Deserialization of Untrusted Data from XML or JSON - java

I know deserialization can be vulnerable when an object is serialized with the standard "Serializable" interface (refer to this). But is this vulnerability applied when an object is serialized to XML or JSON? And if it is, how does that happen?
I can't really see how that could happen, so I would appreciate some examples.
Thanks in advance.

That quite specifically depends on the serialization library that you use to deserialize objects and often the parameters used, so it's hard to provide a single answer.
As to "is it possible", yes, it's possible. Here's a sample exploit for XStream, for example:
http://blog.diniscruz.com/2013/12/xstream-remote-code-execution-exploit.html

A general chat around the topic follows:
A good defence against bad data is to use a serialisation technology that allows one to write a full specification. By full specification, I mean not only the structure / content of objects, but that every value field's valid range can be specified, and every list/array length specified.
There's not many that do this. ASN.1, XSD (XML), and AFAIK JSON schemas can all have value and size constraints. Interestingly there is a formally defined translation between ASN.1 and XSD schemas.
It's then down to whether or not the tools you use actually do anything with these. Most of the ASN.1 tools I've seen do this very well, and will also tell you if you're trying to serialise an object that doesn't conform to the schema. The idea is that bad data is rejected as its read (so you never get an invalid object in memory) and you can never accidentally send / write bad data yourself, even if you wanted to.
Some XSD tools do constraints checking. I think xsd2code++ does. AFAIK xsd.exe from Microsoft does not.
I'm not so familiar with the land of JSON, but as far as I can tell one tends to read in whole objects and then compare them to the schema (which strikes me as being "too late"), rather than have some autogenerated code read the data and check it for you as it does so. When serialising objects it's up to the programmer to compare the result to the schema.
In contrast, technologies like Google Protocol Buffers don't let you do constraints checking at all. With GPB the best you can do is comment the .proto file and hope developers read it.
The code first approach, directly writing serialisable classes in C# / Java can do constraints checks, but only if you write the code yourself.
Useful Old Technology
Of all the serialisations I've ever used, by far the most rigorous has been ASN.1 (using decent ASN.1 tools). It's old and very telecommunications-ish (late 1980s, from the ITU; if you have trouble sleeping, go read one of their standards). However, despite its age it's still bang up to date, continually evolving.
For example, since it's original days it has grown several surprisingly modern wire formats; XML and JSON. Yes that's right; you can have an ASN.1 schema that gets compiled to code (C++, Java, C#) that will serialise to XML or JSON data formats (as well as ASN.1's more traditional binary formats like BER, uPER, etc).
The constraints rigour and the data format flexibility is surprisingly useful; you can receive some ultra-compact bit encoded uPER message from a radio, have it constraints checked as you read it, and then pass it on elsewhere as JSON/XML, all without having to write any code by hand.
When it comes to complex systems integration problems, I've not found anything to beat it.
Useful old technology

Related

Object to bytes array in Java

I'm working on a proprietary TCP protocol. This protocol sends and receive messages with a specific sequence of bytes.
I should be complaiant to this protocol, and i cant change it.
So my input / output results are something like that :
\x01\x08\x00\x01\x00\x00\x01\xFF
\x01 - Message type
\x01 - Message type
\x00\x01 - Length
\x00\x00\x01 - Transaction
\xFF - Body
The sequence of field is important. And i want only the values of the fields in my serialization, and nothing about the structure of the class.
I'm working on a Java controller that use this protocol and I've thought to define the message structures in specific classes and serialize/deserialize them, but I was naive.
First of all I tried ObjectOutputStream, but it output the entire structure of the object, when I need only the values in a specific order.
Someone already faced this problem:
Java - Object to Fixed Byte Array
and solved it with a dedicated Marshaller.
But I was searching for a more flexible solution.
For text serialization and deserialization I've found:
http://jeyben.github.io/fixedformat4j/
that with annotation defines the schema of the line. But it outputs a String, not a byte[]. So 1 is output like "1" that is represented differently based on encoding, and often with more bytes.
What I was searching for is something that given the order of my class properties will convert each property in a bunch of bytes (based on the internal representation) and append them to a byte[].
Do you know some library used for that purpose?
Or a simple way to do that, without coding a serialization algorithm for each of my entities?

Serialization just isn't easy; it sounds from your question like you feel you can just invoke something and out rolls compact, simple, versionable, universal data you can then put on the wire. What you need to fix is to scratch the word 'just' from that sentence. You're going to have to invest some time and care.
As you figured out already, java's baked in serialization has a ton of downsides. Don't use that.
There are various serializers. The popular ones are things like GSON or Jackson, which lets you serialize java objects into JSON. This isn't particularly efficient, and is string based. This sounds like crucial downsides but they really aren't, see below.
You can also spend a little more time specifying the exact format and use protobuf which lets you write a quite lean and simple data protocol (and protobuf is available for many languages, if eventually you want to write an participant in this protocol in non-java later).
So, those are the good options: Go to JSON via Jackson or GSON, or, use protobuf.
But JSON is a string.
You can turn a string to bytes trivially using str.getBytes(StandardCharsets.UTF_8). This cannot fail due to charset encoding differences (as long as you also 'decode' in the same fashion: Turn the bytes into a string with new String(theBytes, StandardCharsets.UTF_8). UTF-8 is guaranteed to be available on all JVMs; if it is not there, your JVM is as broken as a JVM that is missing the String class - not something to worry about.
But JSON is inefficient.
Zip it up, of course. You can trivially wrap an InputStream and an OutputStream so that gzip compression is applied which is simple, available on just about every platform, and fast (it's not the most efficient cutting edge compression algorithm, but usually squeezing the last few bytes out is not worth it) - and zipped-up JSON can often be more efficient that carefully handrolled protobuf, even.
The one downside is that it's 'slow', but on modern hardware, note that the overhead of encrypting and decrypting this data (which you should obviously be doing!!) is usually multiple orders of magnitude more involved. A modern CPU is simply very, very fast - creating JSON and zipping it up is going to take 1% of CPU or less even if you are shipping the collected works of shakespeare every second.
If an arduino running on batteries needs to process this data, go with uncompressed, unencrypted protobuf-based data. If you are facebook and writing the whatsapp protocol, the IAAS creds saved by not having to unzip and decode JSON is tiny and pales in comparison to the creds you spend just running the servers, but at that scale its worth the development effort.
In just about every other case, just toss gzipped JSON on the line.

Java: ArrayList, String manipulation and Parsing

I am writing an application that stores references for books, journals, sites and so on. I mean I have already done most.
What I want is a suggestion regarding what is the best way to implement above specs?
What format text file should I use to store the library? Not file type but format. I am using simple text file at the moment. But planning to implement format as in below.
<book><Art of Human Hacking><New York><2011><1>
<journal><Achieving Maximum Speed In 802.11n><London><2009><260-265>
1st tag <book> and <journal> are type identifier. I have used ArrayList. Should I use multi dimensional ArrayList and store each item like below?
[[Art of Human Hacking,New York,2011,1][Achieving Maximum Speed In 802.11n,London,2009,260-265]]
I have used StringTokenizer and I cannot differentiate spaces. How do I fix this?
I have already implemented all features including listing all, listing unique, searching, editing, removing, adding. But everything is done to content without spaces.

You should use an existing serializer instead of writing your own, unless the project forbids it.
For compatability and human readability, CSV would be your best bet. Use an existing CSV parser to get your escaping correct (not that hard to write yourself, but difficult enough to warrant using an existing parser to avoid bugs). Google turned up: http://sourceforge.net/projects/javacsv/
If human editing is not a priority, then JSON is also a good format (it is human readable, but not quite as simple as CSV, and won't display in excel for instance).
Then you have binary protocols, such as native Java serialization, or even Google protocol buffers. These may be faster, but obviously lose the ability to view the data store and would probably complicate your debugging.

Protocol Buffer better than serialization?

I have a large data-structure which i'm serializing.At certain times i need to edit the values in the data-structure.But just for changing a small value i'll have to re-serialize it again instead of updating the changed value in file.I've heard of Google protocol buffer's.Will using it solve my problem of rewriting the file ? Is it a better option for me to use protocol buffer instead of Java serialization ?

Protocol buffers are themselves a serialization format, so they won't fundamentally change the picture (you'll still need to re-serialize after you change a value).
Google's docs claim that protocol buffers are more compact and faster to parse than XML (which seems plausible); don't know how they compare to native Java serialization.
Advantages of protocol buffers might be portability (if programs written in other languages need to read the file) and upgradability (you can add new fields to the data structure without breaking the file format).

A couple of points
There is an editor for Protocol Buffers binary format (http://code.google.com/p/protobufeditor/)
Protocol buffers has a text format that looks like:
# Textual representation of a protocol buffer.
# This is *not* the binary format used on the wire.
person {
name: "John Doe"
email: "jdoe#example.com"
}
See:
Discussion: http://groups.google.com/group/protobuf/browse_thread/thread/04fc478088137bf3
Class: http://code.google.com/apis/protocolbuffers/docs/reference/java/com/google/protobuf/TextForm
Having said that, I would use a technology (JSon, Xml etc) that is already in use unless one of the following applies
You need the performance of protocol buffers
You already / plan to use protocol buffers

If you care about performance, don't use a text format for your data. If you want to modify the data without deserializing, you'll want to use a fixed record data format. You'll probably have to invent this manually. Then seek to the correct position in the file and rewrite just the changed field. You might look at DataOutputStream to get started or instead use a database such as HSQLDB to store and edit your data.
Thinking about this more, Unless your objects are very simple, I think a database would be a better way to go.
More info on DataOutputStream:
http://download.oracle.com/javase/tutorial/essential/io/datastreams.html
Java Databases:
http://java-source.net/open-source/database-engines

You need a serialization format that can directly be modified for example XML or JSON. Google protocol buffer is a binary format -- as the java serialization -- and thus can not be modifier directly...

Sanitize json input to a java server

I'm using json to pass data between the browser and a java server.
I'm using Json-lib to convert between java objects and json.
I'd like to strip out susupicious looking stuff (i.e "doSomethingNasty().) from the user input while converting from json to java.
I can imagine several points at which I could do this:
I could examine the raw json string and strip out funny-looking stuff
I could look for a way to intercept every json value on its way into the java object, and look for funny stuff there.
I could traverse my new java objects immediately after reconstitution from json, look for any fields that are Strings, and stripp stuff out there.
What's the best approach? Are there any technologies built for this this task that I tack tack on to what I have already?

I suggest approach 3: traverse the reconstructed Java objects immediately upon arrival, and before any other logic can act on them. Build the most restrictive validation you can get away with (that is, do white-listing).
You can probably do this in a single depth-first traversal of the object hierarchy that you retrieve from Json-lib. In general, you will not want to accept any JSON functions in your objects, and you will want to make sure that all values (numbers, strings, depth of object tree, ...) are within expected ranges. Yes, it is a huge hassle to do this well, but believe me, the alternative to good user-input validation is much, much worse. It may be a good idea to add logging for whenever you chop things out, to diagnose any possible bugs in your validation code.

As I understand you need to validate the JSON data coming into your application.
If you want to do white listing ("You know the data you expect and nothing else is valid"), then it makes sense to validate your java objects once they are created ("make sure not to send the java object to DB or back to UI in some way before validation is done).
In case you want to black listing of characters (you know some of the threat characters which you want to avoid"), then you can directly look at the json string as this validation would not change much over a period of time and even if it does, you only need to enhance one common place. For while listing iot would depend on your business logic.

Castor and sockets

I'm new to Castor and data binding in general. I'm working on an application that, in part, needs to take data off of a socket and unmarshall the data to make POJOs. Now, I've got the socket stuff down, and I've even generated and compiled java files thanks to Ant and Castor.
Here's the problem: the data stream that I'll receive could be one of about 9 different objects. That is, I receive a stream of text (XML) that represents an object with stuff that I'll operate on; again, depending on the object type. If it were just one object, it'd be easy: call the unmarshall commands on it and go on my merry way. But, since it could be one of many kinds of objects, who do I know what to unmarshall? I read up on mapping, but either I didn't get it, or it seems like a static mapping, not a dynamic mapping.
Any help out there?

You are right, Castor expects a static mapping. But you can work with that. You can write some code that will modify the incoming xml so that, on your side, Castor can use one schema, and on your clients' side they don't have to change their schemas.
Change the schema that Castor expects to get to something with a common root-element, with under that your nine different alternatives for your different objects (I think you can restrict it so the schema will allow only one of the nine, if that doesn't work out you could just make all the sub-elements optional).
Then you can write code that modifies the incoming xml to wrap your incoming xml with that common root-element, then feeds the wrapped xml into a stream that gets read by the Castor unmarshaller.
There are at least 3 different ways to implement the xml-wrapping part: SAX, XSLT, and XML libraries (like JDOM, DOM4J, and XOM--I prefer XOM but any of them will work).
The SAX way is probably best if you're already familiar with SAX or if one of the other ways has worked but come up short on performance. If I had to implement that then I would create an XMLFilter that takes in xml and writes xml out, stacking that on top of another piece that writes xml to an OutputStream, and writing a wrapper method around the unmarshalling stuff to feed the incoming stream to the xmlreader, copy the OutputStream to another InputStream (an easy way is to use commons-io), and feed the new InputStream to the Castor unmarshaller.
With XSLT there is no fooling with SAX, although XSLT has a reputation for pain sometimes, it seems to me like this might be a relatively straightforward transformation, but I haven't taken a stab at it either. It is a long time since I used XSLT for anything. I am not sure about performance either, though I wouldn't write it off out of hand.
Using XOM or JDOM or DOM4J to wrap the XML is also possible, and the learning curve is a lot lower than for SAX or XSLT. The downside is the whole XML document tends to get sucked into memory at once so if you deal with big enough documents you could run out of memory.

I have a similar thing in Jibx where all of the incoming message objects implement a base interface which has a field denoting the message type.
The text/xml is serialized into the base interface and I then used the command pattern to call the respective business logic depending upon the message type which is defined in the base interface.
Not sure if this is possible using castor but take a look at Jibx as the performance is fantastic.
http://jibx.sourceforge.net/

I appreciate your insights. You both have given me some good information to go on and new knowledge that I didn't have. In the end, I got the process to work via a hack. I grab the text stream, parse out the root tag of the message, and then switch on it to determine the right object to create. I'm unmarshalling all of my objects independently and everyone is happy on our end.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.