I'm using json to pass data between the browser and a java server.
I'm using Json-lib to convert between java objects and json.
I'd like to strip out susupicious looking stuff (i.e "doSomethingNasty().) from the user input while converting from json to java.
I can imagine several points at which I could do this:
I could examine the raw json string and strip out funny-looking stuff
I could look for a way to intercept every json value on its way into the java object, and look for funny stuff there.
I could traverse my new java objects immediately after reconstitution from json, look for any fields that are Strings, and stripp stuff out there.
What's the best approach? Are there any technologies built for this this task that I tack tack on to what I have already?
I suggest approach 3: traverse the reconstructed Java objects immediately upon arrival, and before any other logic can act on them. Build the most restrictive validation you can get away with (that is, do white-listing).
You can probably do this in a single depth-first traversal of the object hierarchy that you retrieve from Json-lib. In general, you will not want to accept any JSON functions in your objects, and you will want to make sure that all values (numbers, strings, depth of object tree, ...) are within expected ranges. Yes, it is a huge hassle to do this well, but believe me, the alternative to good user-input validation is much, much worse. It may be a good idea to add logging for whenever you chop things out, to diagnose any possible bugs in your validation code.
As I understand you need to validate the JSON data coming into your application.
If you want to do white listing ("You know the data you expect and nothing else is valid"), then it makes sense to validate your java objects once they are created ("make sure not to send the java object to DB or back to UI in some way before validation is done).
In case you want to black listing of characters (you know some of the threat characters which you want to avoid"), then you can directly look at the json string as this validation would not change much over a period of time and even if it does, you only need to enhance one common place. For while listing iot would depend on your business logic.
Related
I know deserialization can be vulnerable when an object is serialized with the standard "Serializable" interface (refer to this). But is this vulnerability applied when an object is serialized to XML or JSON? And if it is, how does that happen?
I can't really see how that could happen, so I would appreciate some examples.
Thanks in advance.
That quite specifically depends on the serialization library that you use to deserialize objects and often the parameters used, so it's hard to provide a single answer.
As to "is it possible", yes, it's possible. Here's a sample exploit for XStream, for example:
http://blog.diniscruz.com/2013/12/xstream-remote-code-execution-exploit.html
A general chat around the topic follows:
A good defence against bad data is to use a serialisation technology that allows one to write a full specification. By full specification, I mean not only the structure / content of objects, but that every value field's valid range can be specified, and every list/array length specified.
There's not many that do this. ASN.1, XSD (XML), and AFAIK JSON schemas can all have value and size constraints. Interestingly there is a formally defined translation between ASN.1 and XSD schemas.
It's then down to whether or not the tools you use actually do anything with these. Most of the ASN.1 tools I've seen do this very well, and will also tell you if you're trying to serialise an object that doesn't conform to the schema. The idea is that bad data is rejected as its read (so you never get an invalid object in memory) and you can never accidentally send / write bad data yourself, even if you wanted to.
Some XSD tools do constraints checking. I think xsd2code++ does. AFAIK xsd.exe from Microsoft does not.
I'm not so familiar with the land of JSON, but as far as I can tell one tends to read in whole objects and then compare them to the schema (which strikes me as being "too late"), rather than have some autogenerated code read the data and check it for you as it does so. When serialising objects it's up to the programmer to compare the result to the schema.
In contrast, technologies like Google Protocol Buffers don't let you do constraints checking at all. With GPB the best you can do is comment the .proto file and hope developers read it.
The code first approach, directly writing serialisable classes in C# / Java can do constraints checks, but only if you write the code yourself.
Useful Old Technology
Of all the serialisations I've ever used, by far the most rigorous has been ASN.1 (using decent ASN.1 tools). It's old and very telecommunications-ish (late 1980s, from the ITU; if you have trouble sleeping, go read one of their standards). However, despite its age it's still bang up to date, continually evolving.
For example, since it's original days it has grown several surprisingly modern wire formats; XML and JSON. Yes that's right; you can have an ASN.1 schema that gets compiled to code (C++, Java, C#) that will serialise to XML or JSON data formats (as well as ASN.1's more traditional binary formats like BER, uPER, etc).
The constraints rigour and the data format flexibility is surprisingly useful; you can receive some ultra-compact bit encoded uPER message from a radio, have it constraints checked as you read it, and then pass it on elsewhere as JSON/XML, all without having to write any code by hand.
When it comes to complex systems integration problems, I've not found anything to beat it.
Useful old technology
I am making an auto chat client like Cleverbot for school. I have everything working, but I need a way to make a knowledge base of responses. I was going to make a matrix with all the responses that I need the bot to say, but I think it would be hard to edit the code every time I want to add a responses to the bot. This is the code that I have for the knowledge base matrix:
`String[][] Database={
{"hi","hello","howdy","hey"},//possible user input
{"hi","hello","hey"},//response
{"how are you", "how r u", "how r you", "how are u"},
{"good","doing well"}`
How would I make a matrix like this from a text file? Is there a better way than reading from a text file to deal with this?
You could...
Use a properties file
The properties file is something that can easily be read into (and stored from, but you're not interested in that) Java. The class java.util.Properties can make that easier, but it's fairly simple to load it and then you access it like a Map.
hello.input=hi,hello,howdy,hey
hello.output=hi,hello,hey
Note the matching formats there. This has its own set of problems and challenges to work with, but it lets you easily pull things in to and out of properties files.
Store it in JSON
Lots of things use JSON for a serialization format. And thus, there are lots of libraries that you can use to read and store from it. It would again make some things easier and have its own set of challenges.
{
"greeting":{
"input":["hi","hello","howdy","hey"],
"output":["hi","hello","hey"]
}
}
Something like that. And then again, you read this and store it into your data structures. You could store JSON in a number of places such as document databases (like couch) which would make for easy updates, changes, and access... given you're running that database.
Which brings us to...
Embedded databases
There are lots of databases that you can stick right in your application and access it like a database. Nice queries, proper relationships between objects. There are lots of advantages to using a database when you actually want a database rather than hobbling strings together and doing all the work yourself.
Custom serialization
You could create a class (instead of a 2d array) and then store the data in a class (in which it might be a 2d array, but that's an implementation detail). At this point, you could implement Serializable and write the writeObject and readObject methods and store the data somehow in a file which you could then read back into the object directly. If you have the administration ability of adding new things as part of this application (or another that uses the same class) you could forgo the human readable aspect of it and use the admin tool (that you write) to update the object.
Lots of others
This is just the tip of the iceberg. There are lots of ways to go about this.
P.S.
Please change the name of the variable from Database to something in lower case that better describes it such as input2output or the like. Upper case names are typically reserved for class names (unless its all upper case, in which case it's a final static field)
A common solution would be to dump the data in to a properties file, and then load it with the standard Properties.load(...) method.
Once you have your data like that, you can then access the data by a map-like interface.
You could find different ways of storing the data in the file like:
userinput=hi,hello,howdy,hey
response=hi,hello,hey
...
Then, when you read the file, you can split the values on the comma:
String[] expectHello = properties.getProperty("userinput").split(",");
Any advice on how to support repeated messages? Specifically, if these message are all one type. In JSON, these would essentially be an array. In my case, I do not care about indexing however, but that is not saying that an array type would not be useful for protobuf. I have considered the below approaches, but I don' like the tradeoff's. It ins't clear from reading the Google documentation which approach is meant to be used for collections.
Use any existing message and just have a bunch of empty fields
You can use an existing type and just only include the desired collection of repeated messages. So if a user message type has repeated photo message type, send an empty user with nothing but the photo collection field.
Create a wrapper type
This is what #1 does but instead of using an existing type, create a new one. This is a little cleaner because it is explicit and doesn't use empty fields. Still has message typing too. In photo case, this would be an ArrayOfPhotos message w/ only repeated photo message field.
Use delimited stream
Not too sure about this method as I haven't tried it, but protobuf supports delimiting streams. This seems cool, but I would imagine it has downside of less strong typing. Streams could contain a grab bag of different message types.
Does seem beneficial though that this option requires no extra Message types.
In photo case, this would be delimited photo messages, but again, seems like you could throw user messages in as well.
It sounds like you're trying to ask what to do when your top-level data is an array rather than a record. (It isn't totally clear from your question whether you're asking about top-level, but otherwise I don't understand the problem.)
The questions to ask yourself are:
Is there any chance that some day you'll want to add some auxiliary data to this message which is not attached to any one of the objects? For instance, maybe your list of photos will some day have an album name attached. In this case, you certainly want to use your solution #2, since it gives you the flexibility to add other fields later without messing up some existing message type.
Will it be a problem for the client or the server to have to hold the entire data set in memory and parse/serialize it all at once? This is a requirement for a single message. For example, if you're sending 1GB of photos, you probably want each end to be able to handle one or just a few photos at a time. In this case you certainly want solution #3.
I would not advise using solution #1 in any case.
I am writing an application that stores references for books, journals, sites and so on. I mean I have already done most.
What I want is a suggestion regarding what is the best way to implement above specs?
What format text file should I use to store the library? Not file type but format. I am using simple text file at the moment. But planning to implement format as in below.
<book><Art of Human Hacking><New York><2011><1>
<journal><Achieving Maximum Speed In 802.11n><London><2009><260-265>
1st tag <book> and <journal> are type identifier. I have used ArrayList. Should I use multi dimensional ArrayList and store each item like below?
[[Art of Human Hacking,New York,2011,1][Achieving Maximum Speed In 802.11n,London,2009,260-265]]
I have used StringTokenizer and I cannot differentiate spaces. How do I fix this?
I have already implemented all features including listing all, listing unique, searching, editing, removing, adding. But everything is done to content without spaces.
You should use an existing serializer instead of writing your own, unless the project forbids it.
For compatability and human readability, CSV would be your best bet. Use an existing CSV parser to get your escaping correct (not that hard to write yourself, but difficult enough to warrant using an existing parser to avoid bugs). Google turned up: http://sourceforge.net/projects/javacsv/
If human editing is not a priority, then JSON is also a good format (it is human readable, but not quite as simple as CSV, and won't display in excel for instance).
Then you have binary protocols, such as native Java serialization, or even Google protocol buffers. These may be faster, but obviously lose the ability to view the data store and would probably complicate your debugging.
I'm developing a Java package that makes basic HTTP requests (GET, POST, PUT, and DELETE). Right now, I'm having it just print the output of the request. I would like to store it in a field, but I'm not sure if String supports large amounts of text. Is there a data type for large amounts of text, or is there a reasonable alternative to it? Right now, because I'm just printing it, I can't do anything with the data that is returned (like parse it, if it's JSON).
Any ideas would be helpful.
Edit: The code is online on GitHub.
Strings can take up to 2^31 - 1 characters so I suspect are big enough. Data from SO question
I see that you use BufferedReader in your code. You can just leave the string in there and pass that reader to your JSON parser for instance. Would be more efficient than first creating a String out of it.
If you are performing a single set of operations on the data, you can stream it through a pipeline and not even store the entire data in memory at any time. It can also boost performance as work can begin upon the first character rather than after the last is received. Check out CharSequence.