Soap body is utf-8 encoded twice - java

We use a web service which expects UTF-8. The framework we use on the client is Apache Axis2. We call the web service and the soap body contains strings in UTF-8. The problem is that it seems like the body is "double encoded". I.e we have the character 'å'. The utf-8 representation of 'å' in utf-8 is C3 A5 however we see in our logs that the (double) encoded value sent is C3 83 C2 A5.
Has anyone experienced similiar problems?

It's not entirely clear how you're calling the web service. Does the method in the web service just take a string? If so, what does your string look like in Java? All strings in Java are UTF-16 encoded - if you're converting the UTF-8 binary representation into a string by taking each byte and turning it into a character, then that's the problem.
If you could show what the method you're calling looks like, and how you're calling it, that would help a lot.
For what it's worth, I've used Axis with non-ASCII strings with no problem in the past. I strongly suspect this is a problem with how you're using it rather than with Axis itself, although I'm willing to be proved wrong :)
EDIT: Based on your comment, it sounds like you've got problems receiving the HTML form data, before you hit the web service. If the user has typed "å" into the form, then that's what you should see when you debug in Eclipse. If you're putting bad data into your web service, it's no wonder you're getting bad data out at the other end. I suggest you run WireShark to see exactly what the browser is sending you, both in terms of the raw bytes and also what content encoding it's specifying. My guess is that your web server is treating it as ISO-8859-1 but it's actually UTF-8.
Once you've got the string correctly from the form, I suspect you'll find there are no problems at all in passing it on to the web service.

Related

Java changes Cyrillic to unicode like \uXXXX

I am making an application in Java, that will log into my school diary using web api, so I will be able to make my own UI. As the title says, Java at some moment changes the cyrillic to unicode like \uXXXX symbolds. Here is the code on the Russian Stackoverflow: https://ru.stackoverflow.com/questions/1452959/%d0%a1%d0%b5%d1%80%d0%b2%d0%b5%d1%80-%d0%be%d1%82%d0%b2%d0%b5%d1%80%d0%b3%d0%b0%d0%b5%d1%82-%d0%b7%d0%b0%d0%bf%d1%80%d0%be%d1%81. Try to translate it, to understand more. When I am sending my request to https://httpbin.org/post instead of my LOGIN_URL with cyrillic symbols it returns them transformed, if I send request with ascii symbols, I get them back, and, in the linked post I mentioned the python project, which does exactly the same thing I want. And when I modify it to make it send request to httpbin, the cyrillic symbols are returned back! What do I do to fix my java code? P.S. Currently I am switched to okhttp3 from apache http client (same problem), but, I can go back.
Well, I solved my problem. It consisted not in the character encoding, but in the absence of two http headers, namely
httpPost.setHeader("X-Requested-With", "XMLHttpRequest");
httpPost.setHeader("Referer", Constants.BASE_URL);
(added to login request)

Encoding goes wrong in the transport of a SOAP message

Context
I have a SOAP web service which is served by a JBOSS EAP instance and is called via a SOAP UI client.
In the result returned by this web service there may be an XML string returned like this by the web service:
The same string will be rendered as follows in the SOAP UI client:
As you can observe, during the transport of this message some characters (specifically <) have been encoded to <: this is normal, as the encoder wants to avoid that the string gets interpreted as markup when it's just an output to be returned as is.
Problem
What we have observed is that when the string is too long, the encoding goes just wrong. I've tried to analyze and understand and this is all I can get:
Towards the end of the string, some < characters are left as such and are not converted into <
Very weirdly, an XML tag that is normally formed on server side:
<calculationPeriod>
...some stuff
</calculationPeriod>
... has its second c converted into < and that clearly breaks completely the XML:
<cal<ulationPeriod>
...some stuff
</calculationPeriod>
My question
Honestly, I have no idea how to debug this issue furtherly. All I can notice is that:
When inside the web service (stack that I control), the response is normally formed and encoded in XML using the open tag <.
Once out in the SOAP UI client (all across the stack there are generic JBOSS calls and RMI invocations), the message gets corrupted like this.
It is important to remark that this only happens when the string is particularly long. I have one output with length 8192 characters (before encoding) that goes fine, while the other output having length 9567 characters (before encoding) goes wrong and is the subject of this question.
Apologises :)
I'm sorry not to be able to provide a reproductible use case, as well as to use a title which means nothing and everything in the question.
I'm open to provide any additional information for those who may help and to rephrase the question once I get a clearer picture of what the problem is.
I've of course looked a lot on the web but I can't find anything similar, probably I don't search with the right keywords.

Issue with soap call and content length

I have written a java swing app that is sending SOAP requests based on this code here
Overall it is working great, however I have just started testing it when parsing Chinese characters in the soap:BODY and this causes an error where I get a 400 response from the Web Server:
s.AddParameter("xml", "班");
Using wireshark i eventually tracked it down to the Content-Length value that was constructed being incorrect when parsing these Chinese characters (and i am assuming any other multibyte(?) character).
I have proven this by overriding the content length generation by simply changing the code to this:
out.println("Content-Length: " + String.valueOf(postData.length()+2));
Obviously this is not a solution as it only proved my very isolated test case of sending a single character, but i believe the issue is that the postData.length() is calculated first and then on posting the data my 班 character is then converted to \347\217\255, throwing the content-length out and causing the request to fail.
So I am asking for advice on how to resolve this issue?
Is it possible for me to encode the value first, obtain the content length and suppress the encoding on the post? I am unsure what is actually encoding it; the PrintWriter i am assuming?
Regards.

Does javascript by default have support for UTF-8

I want to know whether javascript is by default UTF-8 compliant or not. If no, then how to make it UTF-8 compliant.
I am sharing you the snapshot, i have used german charterers in input type email as : αβγδεζηθ in the image shown below, which when entered shows ok, but after the request is made to java to save this value in database, the value in this field shows unknown charterers.
Is this javascript issue or java issue ..
javascript supports unicode.
Proof:
゚ω゚ノ=/`m´)ノ~┻━┻//*´∇`*/['_'];o=(゚ー゚)=_=3;c=(゚Θ゚)=(゚ー゚)-(゚ー゚);(゚Д゚)=(゚Θ゚)=(o^_^o)/(o^_^o);(゚Д゚)={゚Θ゚:'_',゚ω゚ノ:((゚ω゚ノ==3)+'_')[゚Θ゚],゚ー゚ノ:(゚ω゚ノ+'_')[o^_^o-(゚Θ゚)],゚Д゚ノ:((゚ー゚==3)+'_')[゚ー゚]};(゚Д゚)[゚Θ゚]=((゚ω゚ノ==3)+'_')[c^_^o];(゚Д゚)['c']=((゚Д゚)+'_')[(゚ー゚)+(゚ー゚)-(゚Θ゚)];(゚Д゚)['o']=((゚Д゚)+'_')[゚Θ゚];(゚o゚)=(゚Д゚)['c']+(゚Д゚)['o']+(゚ω゚ノ+'_')[゚Θ゚]+((゚ω゚ノ==3)+'_')[゚ー゚]+((゚Д゚)+'_')[(゚ー゚)+(゚ー゚)]+((゚ー゚==3)+'_')[゚Θ゚]+((゚ー゚==3)+'_')[(゚ー゚)-(゚Θ゚)]+(゚Д゚)['c']+((゚Д゚)+'_')[(゚ー゚)+(゚ー゚)]+(゚Д゚)['o']+((゚ー゚==3)+'_')[゚Θ゚];(゚Д゚)['_']=(o^_^o)[゚o゚][゚o゚];(゚ε゚)=((゚ー゚==3)+'_')[゚Θ゚]+(゚Д゚).゚Д゚ノ+((゚Д゚)+'_')[(゚ー゚)+(゚ー゚)]+((゚ー゚==3)+'_')[o^_^o-゚Θ゚]+((゚ー゚==3)+'_')[゚Θ゚]+(゚ω゚ノ+'_')[゚Θ゚];(゚ー゚)+=(゚Θ゚);(゚Д゚)[゚ε゚]='\\';(゚Д゚).゚Θ゚ノ=(゚Д゚+゚ー゚)[o^_^o-(゚Θ゚)];(o゚ー゚o)=(゚ω゚ノ+'_')[c^_^o];(゚Д゚)[゚o゚]='\"';(゚Д゚)['_']((゚Д゚)['_'](゚ε゚+(゚Д゚)[゚o゚]+(゚Д゚)[゚ε゚]+(゚Θ゚)+(゚ー゚)+(゚Θ゚)+(゚Д゚)[゚ε゚]+(゚Θ゚)+((゚ー゚)+(゚Θ゚))+(゚ー゚)+(゚Д゚)[゚ε゚]+(゚Θ゚)+(゚ー゚)+((゚ー゚)+(゚Θ゚))+(゚Д゚)[゚ε゚]+(゚Θ゚)+((o^_^o)+(o^_^o))+((o^_^o)-(゚Θ゚))+(゚Д゚)[゚ε゚]+(゚Θ゚)+((o^_^o)+(o^_^o))+(゚ー゚)+(゚Д゚)[゚ε゚]+((゚ー゚)+(゚Θ゚))+(c^_^o)+(゚Д゚)[゚ε゚]+(゚ー゚)+((o^_^o)-(゚Θ゚))+(゚Д゚)[゚ε゚]+(゚Θ゚)+(゚Θ゚)+(c^_^o)+(゚Д゚)[゚ε゚]+(゚Θ゚)+(゚ー゚)+((゚ー゚)+(゚Θ゚))+(゚Д゚)[゚ε゚]+(゚Θ゚)+((゚ー゚)+(゚Θ゚))+(゚ー゚)+(゚Д゚)[゚ε゚]+(゚Θ゚)+((゚ー゚)+(゚Θ゚))+(゚ー゚)+(゚Д゚)[゚ε゚]+(゚Θ゚)+((゚ー゚)+(゚Θ゚))+((゚ー゚)+(o^_^o))+(゚Д゚)[゚ε゚]+(゚ー゚)+(c^_^o)+(゚Д゚)[゚ε゚]+(゚Θ゚)+((o^_^o)-(゚Θ゚))+((゚ー゚)+(o^_^o))+(゚Д゚)[゚ε゚]+(゚Θ゚)+((゚ー゚)+(゚Θ゚))+((゚ー゚)+(o^_^o))+(゚Д゚)[゚ε゚]+(゚Θ゚)+((o^_^o)+(o^_^o))+((o^_^o)-(゚Θ゚))+(゚Д゚)[゚ε゚]+(゚Θ゚)+((゚ー゚)+(゚Θ゚))+(゚ー゚)+(゚Д゚)[゚ε゚]+(゚Θ゚)+(゚ー゚)+(゚ー゚)+(゚Д゚)[゚ε゚]+(゚ー゚)+((o^_^o)-(゚Θ゚))+(゚Д゚)[゚ε゚]+((゚ー゚)+(゚Θ゚))+(゚Θ゚)+(゚Д゚)[゚o゚])(゚Θ゚))('_');
JsFiddle
The above code prooves that js indeed supports unicode. The above code alerts "Hello World" and is taken from here
So, it is not js' fault. Fault is somewhere else, or the system is unable to show those characters.
JavaScript is definitely UTF-8 compliant.
However, there are many points of failure for UTF-8. You will need to make sure that your server-side code can handle UTF-8. Then check that your database columns use UTF-8 as their character set. Then ensure that the connection between your server-side code and your database is using UTF-8.
The "conversion" of characters to ?s suggests to me that it's the database that's missing encoding information.
JavaScript (the language) comes with Unicode support (all strings are inherently Unicode), but not with default UTF8 support. That is part of the APIs and their implementation. For example, in a browser the DOM does support it.
When you are sending data from a browser to a server, a) you need to set the correct headers b) send the correct data and the browser will do as expected. Then, c) the server needs to understand this request and use the correct encoding for deserialising the data. If something does not work, ask a question that shows your code.

Handle multiple language encoding

In my application, I read tweets from twitter, but the tweets are not language restricted. So when I am trying to send response for a Chinese/Japanese tweet the content is not displayed correctly. I have currently set the
response.setContentType("text/html;charset=UTF-8");
before sending the response.
How can we handle multiple languages?
i can see the message sent
{"lastPost":{"lastUpdate":"毋成金口","pubDate":"Fri Aug 12 00:39:09 UTC 2011","message_id":101814948329562112}
this is a json string and added to the response..
on my client i.e iphone the lastpost is "????"
Telling the browser that the page is UTF-8 is a good thing, but useless unless you make sure that you are actually writing only UTF-8 in the page.
To make sure this happens :
Whenever you read, from twitter or whatever, always require UTF-8 data, make sure you are receiving UTF-8 bytes.
When you create a string from raw bytes, Java by default uses the "platform default encoding" which could be anything. Bytes to string conversion happens when creating a new String from a byte array or when using a Reader. Both these methods allow you to explicitly define the encpding you expect the bytes to be. Once point 1 is checked and you are receiving UTF-8 byes, make sure everywhere in your application you are specifying to use UTF-8 when converting bytes to strings.
when using a Writer, to convert strings to bytes sent for example to the browser (the servlet writer), the same rules apply : try to be explicit and always specify UTF-8
If you store stuff in databases, then you have two encoding problems. The first one is which encpding you database is using when talking to your application (connection encoding), the second is which encoding the database is actually storing strings in (storage encoding). Usually, you can specify only the connection encoding from Java, while the storage encoding is specified in the database when it is created (search for "collation" if you are using mysql).
Detecting where a string that is supposed to be UTF-8 gets reencoded badly is a hard task. 99% of the times, it is being converted to ISO-latin or similar encoding somewhere, which causes special characters like à or ì appear as two chars of garbage. Often debugging is the only way to find out where this happens.
the problem was with the client encoding.. it was set to ISO-

Categories