thrift character encoding, perl to java

thrift character encoding, perl to java - java

I have a complex situation that I'm trying to deal with involving character encoding.
I have a perl program which is communicating with a java endpoint via thrift, the java is then using the data to make a request to a legacy php service. It's ugly, but part of a migration plan so needs to work for a short while.
In perl a thrift object is created where some of the fields of the thrift object are json encoded strings.
The problem is that when perl makes the request to java, one of the strings is as follows (this is from data:dumper and is subsequently json encoded and added to thrift):
'offer_message' => "<<>>
&&
\x{c3}\x{82}\x{c2}\x{a9}©
<script>alert(\"XSS\");</script>
https://url.com/imghp?hl=uk",
However, when this data is received on the java side the sequence \x{c3}\x{82}\x{c2}\x{a9} has been converted so in java we receive the following:
<<>>\\n&&\\nÃ�Â�Ã�Â©©\\n<script>alert(\"XSS\");</script>\\nhttps://www.google.com.ua/imghp?hl=uk
The problem is that if I pass the second string to the legacy php program, it fails, if I pass the string taken from the dump of the perl hash, it succeeds. So my assumption is that I need to convert the received string to another encoding (correct me if I'm wrong, I'm not sure that this is the right solution).
I've tried taking the parameters received in java and converting them to every encoding I can think of, however it doesn't work. So for example:
byte[] utf8 = templateParams.getBytes("UTF8");
normallisedTemplateParams = new String(utf8, "UTF8");
I've been varying the encoding schemes in the hope I find something that works.
What is the correct way to solve this? For a short time this messy solution is my only option while other re-engineering is happening.

The problem in the end difficult to diagnose but simple to resolve. It turned out that the package I was using to convert in Java was using java's default encoding of UTF-16. I had to modify the package and force it to use UTF-8. After that, everything worked.

Related

Rules regarding auto encoding of messages into base64 while submitting to SQS

I am developing an application in which clients (written in multiple languages - Go, C++, Python, C#, Java, Perl and possibly more in the future) submit protobuf (and in some cases, JSON) messages to SQS. At the other end, the messages are read and decoded by Python and Go clients - depending on the message type. Boto seems to automatically encode the messages into base64, but other language libraries don't seem to do so. Or maybe there are some other rules?
Boto does have an option to submit raw messages.
What is the expected behavior here? Am I supposed to encode messages into base64 on my own - which makes boto an odd case - or am I missing something?
This has caused some subtle bugs in my application because an of extra layer of base64 encoding or decoding. As far as I know, there is no idiomatic way to detect whether a message is base64 encoded or not. The best option is to try to decode and see if it throws an exception - something I don't really like.
I tried to look for some documentation, but couldn't find anything with clear guidelines. Maybe I was looking at the wrong places?
Thanks in advance for any pointers.

You probably want to encode your messages as something because SQS does not accept every possible byte combination in message payload, at the API. Only valid UTF-8, tab, newline, and carriage return are supported.
Important
The following list shows the characters (in Unicode) allowed in your message, according to the W3C XML specification. For more information, go to http://www.w3.org/TR/REC-xml/#charsets If you send any characters not included in the list, your request will be rejected.
#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_SendMessage.html
The base64 alphabet clearly falls in this range, making it impossible for a message with base64 encoding to be rejected as invalid. Of course, it also bloats your payload, since base64 expands every 3 bytes of the original message into 4 bytes of output (64 symbols limits each output byte to carrying 6 bits of usable information, 3 x 8 → 4 x 6).
Presumably boto automatically base64-encodes and decodes messages for you in order to be "helpful."
But there is no reason why base64 has to be used at all.
An example that comes to mind... valid JSON would also comply with the restricted character ranges supported by SQS payloads. (Theoretically, I guess, JSON could be argued not to be an "encoding," but that would be a bit pedantic).
There is no clean way to determine whether a message needs to be decoded more than once, other than the sketchy one you proposed, but the argument could be made that if you are in a situation where the need to decode is ambiguous, then that should be eliminated.
If boto's behavior weren't documented and there were no way to make it behave otherwise, I'd say it is wrong behavior. But, as it is, I'll have to relent a bit and say it's just unusual.

Efficient transcoding of Jackson parsed JSON

I'm using Jackson streaming API to deserialise a quite large JSON (on the order of megabytes) into POJO. It's working fine, but I'd like to optimize it (both memory and processing wise, code runs on Android).
The main problem I'd like to optimize away is converting a large number of strings from UTF-8 to ISO-8859-1. Currently I use:
String result = new String(parser.getText().getBytes("ISO-8859-1"));
As I understand it, parser originally copies token content into String (getText()), then creates a byte array from it (getBytes()), which is then used to create a final String in desired encoding. Way too much allocations and copying.
Ideal solution would be if getText() would accept the encoding parameter and just give me the final string, but that's not the case.
Any other ideas, or flaws in my thinking?

You can use:
parser.getBinaryValue() (present on version 2.4 of Jackson)
or you can implement an ObjectCodec (with a method readValue(...) that knows converting bytes to String in ISO8859-1) and set it using parser.setCodec().
If you have control over the json generation, avoid using a charset different than UTF-8.

Is it possible to serialize an object into a plain text string and send it as a string then convert it back?

As the title says. I'm sending a message, from my server, into a proxy which is outside of my control which then sends it onto my application. All I can do is send and receive strings. Is it possible to serialize to a plain string and send in this way without an input/output stream as you would normally have?
TIA
A little more info:
public class myClass implements java.io.Serializable {
int h = "ccc";
int i = "bbbb";
String myString = "aaaa";
}
I have this class, for example. Now I want to serialize it and send it as a string inside my HTTPpost and send to the proxy, can't do anything about this stage:
HttpPost post = new HttpPost("http://www.myURL.com/send.php?msg="+msg);
Then receive the msg as a string on the other side and convert it back.
Is that easily done without to many other library?

Yes.
This is done every day using JSON and XML, just to name a few formats of strings that are easily formatted and parsed. (Read about JAXRS to know about a way to use JSON formatted strings to do this and do the transfers. Or, read about JAXB which will format as XML but doesn't halp with the communication of the strings.)
You can do it in CSV format.
You can do it in fixed with fields of characters.
Morse code isn't much of a different concept only it starts with strings and converts to short and long beeps.
The way it works is this:
There is some code to which you pass an object and it returns a string in a known format.
You send the string to the other server somehow. Some ways to send strings have limits on the length.
The other server receives the string.
Using its knowledge of the format, that other server parses out the string contents and uses it.
Some notes:
If both servers use Java (or C# or Python or PHP or whatever) the formatting and parsing become symetrical. You start with a Java object of some type and end up with a Java object in the other JVM of the same type. But that is not a given. You can store values in a custom POJO in one server and a Map in the other.
If you write code to format and parse, it seems really easy as long as the contents are simple and you don't run afoul of transmission rules. For example, if you send in the query part of an HTTP get, you can't have any ampersand characters in the string.
If you use an existing library, you take advantage of everyone else's acquired knowlege of how to do this without error.
If you use a standard format for the string, it is easy to explain what's going on to someone else. If your project works, a third server might want to be in the communication loop and if it's controlled by someone else ...
Formatting is easier than parsing. There are lots of pitfalls that other people have already solved. If you are doing this to learn ways not to do things and improve your own knowledge base, by all means, do it yourself. If you want rock solid performance, use an existing and standard library and format.

Take a look at XStream. It serializes into XML, and is very simple to use.
Take a look at there Two Minute Tutorial

Yes it is possible. You can use ajax to to serialize the string to a json object and have it back to the server using an ajax.post event (javascript event).

How do I convert a Java Hashtable to an NSDictionary (obj-C)?

At the server end (GAE), I've got a java Hashtable.
At the client end (iPhone), I'm trying to create an NSDictionary.
myHashTable.toString() gets me something that looks darned-close-to-but-not-quite-the-same-as [myDictionary description]. If they were the same, I could write the string to a file and do:
NSDictionary *dict = [NSDictionary dictionaryWithContentsOfFile:tmpFile];
I could write a little parser in obj-C to deal with myHashtable.toString(), but I'm sort-of hoping that there's a shortcut already built into something, somewhere -- I just can't seem to find it.
(So, being a geek, I'll spend far longer searching the web for a shortcut than it would take me to write & debug the parser... ;)
Anyway -- hints?
Thanks!

I would convert the Hashtable into something JSON-like and take it on the iPhone side.
Hashtable.toString() is not ideal, it will have problem with spaces, comma and quotation marks.
For JSON-to-NSDictionary, you can find the json-framework tools under http://www.json.org/

As j-16 SDiZ mentioned, you need to serialize your hashtable. It can be to json, xml or some other format. Once serialized, you need to deserialize them into an NSDictionary. JSON is probably the easiest format to do this with plenty of libraries for both Objective-C and Java. http://json.org has a list of libraries.

Developing a (file) exchange format for java

I want to come up with a binary format for passing data between application instances in a form of POFs (Plain Old Files ;)).
Prerequisites:
should be cross-platform
information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
only sequential access is required
should be a way to check data consistency
should be small and fast
should prevent an average user with archiver + notepad from modifying the data
Currently I'm using DeflaterOutputStream + OutputStreamWriter together with InflaterInputStream + InputStreamReader to save/restore objects serialized with XStream, one object per file. Readers/Writers use UTF8.
Now, need to extend this to support the previously described.
My idea of format:
{serialized to XML object}
{delimiter}
{String file name}{delimiter}{byte[] file data}
{delimiter}
{another String file name}{delimiter}{another byte[] file data}
...
{delimiter}
{delimiter}
{MD5 hash for the entire file}
Does this look sane?
What would you use for a delimiter and how would you determine it?
The right way to calculate MD5 in this case?
What would you suggest to read on the subject?
TIA.

It looks INsane.
why invent a new file format?
why try to prevent only stupid users from changing file?
why use a binary format ( hard to compress ) ?
why use a format that cannot be parsed while being received? (receiver has to receive entire file before being able to act on the file. )
XML is already a serialization format that is compressable. So you are serializing a serialized format.

Would serialization of the model (if you are into MVC) not be another way? I'd prefer to use things in the language (or standard libraries) rather then roll my own if possible. The only issue I can see with that is that the file size may be larger than you want.

1) Does this look sane?
It looks fairly sane. However, if you are going to invent your own format rather than just using Java serialization then you should have a good reason. Do you have any good reasons (they do exist in some cases)? One of the standard reasons for using XStream is to make the result human readable, which a binary format immediately loses. Do you have a good reason for a binary format rather than a human readable one? See this question for why human readable is good (and bad).
Wouldn't it be easier just to put everything in a signed jar. There are already standard Java libraries and tools to do this, and you get compression and verification provided.
2) What would you use for a delimiter and how determine it?
Rather than a delimiter I'd explicitly store the length of each block before the block. It's just as easy, and prevents you having to escape the delimiter if it comes up on its own.
3) The right way to calculate MD5 in this case?
There is example code here which looks sensible.
4) What would you suggest to read on the subject?
On the subject of serialization? I'd read about the Java serialization, JSON, and XStream serialization so I understood the pros and cons of each, especially the benefits of human readable files. I'd also look at a classic file format, for example from Microsoft, to understand possible design decisions from back in the days that every byte mattered, and how these have been extended. For example: The WAV file format.

Let's see this should be pretty straightforward.
Prerequisites:
0. should be cross-platform
1. information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
2. only sequential access is required
3. should be a way to check data consistency
4. should be small and fast
5. should prevent an average user with archiver + notepad from modifying the data
Well guess what, you pretty much have it already, it's built-in the platform already:Object Serialization
If you need to reduce the amount of data sent in the wire and provide a custom serialization ( for instance you can sent only 1,2,3 for a given object without using the attribute name or nothing similar, and read them in the same sequence, ) you can use this somehow "Hidden feature"
If you really need it in "text plain" you can also encode it, it takes almost the same amount of bytes.
For instance this bean:
import java.io.*;
public class SimpleBean implements Serializable {
private String website = "http://stackoverflow.com";
public String toString() {
return website;
}
}
Could be represented like this:
rO0ABXNyAApTaW1wbGVCZWFuPB4W2ZRCqRICAAFMAAd3ZWJzaXRldAASTGphdmEvbGFuZy9TdHJpbmc7eHB0ABhodHRwOi8vc3RhY2tvdmVyZmxvdy5jb20=
See this answer
Additionally, if you need a sounded protocol you can also check to Protobuf, Google's internal exchange format.

You could use a zip (rar / 7z / tar.gz / ...) library. Many exists, most are well tested and it'll likely save you some time.
Possibly not as much fun though.

I agree in that it doesn't really sound like you need a new format, or a binary one.
If you truly want a binary format, why not consider one of these first:
Binary XML (fast infoset, Bnux)
Hessian
google packet buffers
But besides that, many textual formats should work just fine (or perhaps better) too; easier to debug, extensive tool support, compresses to about same size as binary (binary compresses poorly, and information theory suggests that for same effective information, same compression rate is achieved -- and this has been true in my testing).
So perhaps also consider:
Json works well; binary support via base64 (with, say, http://jackson.codehaus.org/)
XML not too bad either; efficient streaming parsers, some with base64 support (http://woodstox.codehaus.org/, "typed access API" under 'org.codehaus.stax2.typed.TypedXMLStreamReader').
So it kind of sounds like you just want to build something of your own. Nothing wrong with that, as a hobby, but if so you need to consider it as such.
It likely is not a requirement for the system you are building.

Perhaps you could explain how this is better than using an existing file format such as JAR.
Most standard files formats of this type just use CRC as its faster to calculate. MD5 is more appropriate if you want to prevent deliberate modification.

Bencode could be the way to go.
Here's an excellent implementation by Daniel Spiewak.
Unfortunately, bencode spec doesn't support utf8 which is a showstopper for me.
Might come to this later but currently xml seems like a better choice (with blobs serialized as a Map).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.