Unicode char appears as ?? on debian

Unicode char appears as ?? on debian - java

I'm parsing the JSON received from a Minecraft server's ping request. The code works fine on Windows and gives the following output:
§4§l> §f§l> §4§l> §7-=[ §5§lMythCraft §6§lNetwork §7]=- §4§l> §f§l> §4§l> §7-=[ §b§lFaction 1 Has Reset §e➸ §c§lFresh Map! §7]=-
However, on my Debian VPS the following is outputted instead:
??4??l> ??f??l> ??4??l> ??7-=[ ??5??lMythCraft ??6??lNetwork ??7]=- ??4??l> ??f??l> ??4??l> ??7-=[ ??b??lFaction 1 Has Reset ??e??? ??c??lFresh Map! ??7]=-
I would assume that this is an encoding issue. Am I correct? How can I fix it?
The ping code is here.

I figured it out.
The default charmap on my machine wasn't UTF-8, so it couldn't process the characters correctly, and replaced with with a ? instead.
To fix it, I changed the definition of the String json to be this:
String json = new String(in, Charset.forName("UTF-8"));
That way the returned String is processed in UTF-8 instead of the default encoding.

Related

How to get UTF-8 conversion for a string

Frédéric in java converted to FrÃ©dÃ©ric.
However i need to pass the proper string to my client.
How to achieve this in Java ?
Did tried
String a = "FrÃ©dÃ©ric";
String b = new String(a.getBytes(), "UTF-8");
However string b also contain same value as a.
I am expecting string should able to store value as : Frédéric
How to pass this value properly to client.

If I understand the question correctly, you're looking for a function that will repair strings that have been damaged by others' encoding mistakes?
Here's one that seems to work on the example you gave:
static String fix(String badInput) {
byte[] bytes = badInput.getBytes(Charset.forName("cp1252"));
return new String(bytes, Charset.forName("UTF-8"));
}
fix("FrÃ©dÃ©ric") == "Frédéric"

The answer is quite complicated. See http://www.joelonsoftware.com/articles/Unicode.html for basic understanding.
My first suggestion would be to save your Java file with utf-8. Default for Eclipse on Windows would be cp1252 which might be your problem. Hope I could help.

Find your language code here and use that.
String a = new String(yourString.getBytes(), YOUR_ENCODING);
You can also try:
String a = URLEncoder.encode(yourString, HTTP.YOUR_ENCODING);

If System.out.println("Frédéric") shows the garbled output on the console it is most likely that the encodings used in your sourcecode (seems to be UTF-8) is not the same as the one used by the compiler - which by default is the platform-encoding, so probably some flavor of ISO-8859. Try using javac -encoding UTF-8 to compile your source (or set the appropriate property of your build environment) and you should be OK.
If you are sending this to some other piece of client software it's most likely an encoding issue on the client-side.

URL decode ä -> ã1⁄4

I have the problem that the decoding from a URL causes some major problems. The request URL contains %C3%BC as the letter 'ü'. The decoding server side should now decode it as an ü, but it does this: Ã¼
decoding is done like this:
decoded = URLDecoder.decode(value, "UTF-8");
while value contains '%C3%BC' and decoded should now conatain 'ü', but that's where the problem is. What's going wrong here? I use this method in more than one application and it works fine in all other cases...

I don't have enough reputation yet to comment, so I'll have to make this as close to an answer as possible.
If you're using a servlet, and "value" is something that you got from calling getParameter() on the servlet, then it has already been decoded (rightly or wrongly) by the servlet container. (Tomcat?)
Likewise if it's part of the path. Your servlet container probably decoded it assuming that the percent-encoded bytes were ISO-8859-1, which is the default setting for Tomcat. See the document for the URIEncoding attribute of the Connector element in Tomcat's server.xml file, if that's what appserver you're using. If you set it to UTF-8, Tomcat will assume that percent-encoded bytes represent UTF-8 text.

You are probably outputting the value wrong. First decoded.length() (assumedly 1) gives a fair indication; you could dump it too, Arrays.toString(decoded.toCharArray()).
In the IDE console under Windows you could see something like that mess for a Windows single byte ANSI encoding.
For the rest take care of:
String s;
byte[] b;
s.getBytes() -> s.getBytes(StandardCharsets.UTF_8)
s.getBytes("Cp1252") // Windows Latin-1
new String(b) -> new String(b, StandardCharsets.UTF_8)

IOS Receipt validation IllegalArgumentException

I use similar code as its shown here in the question.
Java and AppStore receipt verification
But I still end up getting
{"status":21002, "exception":"java.lang.IllegalArgumentException"}
Can it be a problem at Base64 encoding?. Do I have to convert the base64 encoded string into hex or something else?.
What i post is similar to following
{"receipt-data" : "eyJzaWduYXR1cmUiOiJBbjNJVER0VVNmZWNhaGMxR.....

The problem was at Base64 encoding inside Java. When I do the encoding inside IOS and use that as the request from server without any encoding in Java, then it worked.

I had a similar problem and was receiving the java.lang.IllegalArgumentException from Apple when trying to validate a receipt on my server. The problem was that my base64 encoding logic was inserting lines breaks into the encoded string. Once I updated my code to ensure no new line breaks were being inserted into the encoded string, I was able to successfully verify my receipts against Apple's servers.

Broken UTF-8 URI Encoding in JSPs

I got a strange issue with wrong URI Encoding and would appreciate any help!
The project uses JSPs, Servlets, Jquery, Tomcat 6.
Charset in the JSPs is set to UTF-8, all Tomcat connectors use URIEncoding=UTF-8 and I also use a character encoding filter as described here.
Also, I set the contentType in the meta Tag and my browser detects it correctly.
In Ajax calls with Jquery I use encodeURIComponent() on the terms I want to use as URL Parameters and then serialize the whole parameter set with $.param(). In the called servlet these parameters are decoded correctly with Java.net.URLDecoder.decode(term, "UTF-8").
In some places I generate URLs for href elements from a parameter map in the JSPs. Each parameter value is encoded with Java.net.URLEncoder.encode(value, "UTF-8") on JSP side but then decoding it the same way as before results in broken special characters. Instead, I have to encode it as "ISO-8859-2" in the JSP which is then decoded correctly as "UTF-8" in the servlet.
An example for clarifying:
The term "überfall" is URIEncoded via Javascript (%C3%BCberfall) and sent to the servlet for decoding and processing, which works. After passing it back to a JSP I would encode it as UTF-8 and build the URL which results for instance in:
Click here
However, clicking this link will send the parameter as "%C3%83%C2%BCberfall" to the servlet which decodes to "Ã¼berfall". The same occurs when no encoding takes place.
When, using "ISO-8859-2" for encoding I get:
Click here
When clicking this link I can observe in Wireshark that %C3%BCberfall is sent as parameter which decodes again to "überfall"!
Can anyone tell me where I miss something?
EDIT:
While observing the Network Tab in Firebug I realized that by using
$.param({term : encodeURIComponent(term)});
the term is UTF-8 encoded twice, resulting in "%25C3%25BCberfall", i.e. the percent symbols are also percent-encoded. Analogously, it works for me if I call encode(term, "UTF-8") twice on each value from the parameter map.
Encoding once and not decoding the String results in "Ã¼berfall" again.

What encoding is Java using internally? Did you start your application with
-Dfile.encoding=utf-8
Please clarify where the "parameter map in the JSPs" is defined. Does it come from some persistent datastorage or are the strings given in your code as literals?
Some thoughts on what is going on, which might help:
Ã¼ is what comes out when a UTF-8 encoded ü is read expecting ISO-8859-1, when each byte is decoded on its own. %C3%BC is the URI-encoded representationg of both UTF-8 bytes of a UTF-8 ü. I think this is what's happening:
%C3%BC gets wrongly decoded to → Ã¼ which gets encoded to → %C3%83%C2%BC which then gets decoded again to → Ã¼ so you end up with Ã¼berfall.
So I guess, you use the wrong encoding for decoding a URI-encoded string. This might have something to do with the internal encoding used by Java/the JVM:
By default, the JRE 7 installer installs a European languages version if it recognizes that the host operating system only supports European languages.

I think I fixed the problem now definitely.
Following Jontro's comment I encoded all URL parameter values once and removed the manual servlet-side decoding.
Sending an ü should look like %C3%BC in Firebug's Network tab which gave me Ã¼ in the servlet.
Java was definitely set to "UTF-8" internal encoding with the -Dfile.encoding parameter.
I traced the problem to the request.getParameter() method like this. request.getQueryString was ok, but when extracting the actual parameters it fails:
request.getCharacterEncoding()) => UTF-8
request.getContentType() => null
request.getQueryString() => from=0&resultCount=10&sortAsc=true&searchType=quick&term=%C3%BC
request.getParameter("term") => Ã¼
Charset.defaultCharset() => UTF-8
OutputStreamWriter.getEncoding() => UTF8
new String(request.getParameter("term").getBytes(), UTF-8) => Ã¼
System.getProperty("file.encoding") => UTF-8
By looking into the sources of Tomcat and Coyote which implement request.getParameter() i found the problem: the URIEncoding from the connector was always null and in this case it defaults to org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING which is "ISO-8859-1" like Wolfram said.
Long story short: my fault was editing the server.xml in Tomcat's conf directory which is only loaded ONCE into Eclipse when a new server is created in the servers view! After that, a separate server.xml in the Servers project has to be edited. After doing so, the connector setting is loaded correctly and everything works as it should.
Thanks for the comments! Hope this helps someone...

Converting from Java String to Windows-1252 Format

I want to send a URL request, but the parameter values in the URL can have french characters (eg. è). How do I convert from a Java String to Windows-1252 format (which supports the French characters)?
I am currently doing this:
String encodedURL = new String (unencodedUrl.getBytes("UTF-8"), "Windows-1252");
However, it makes:
param=Stationnement extèrieur into param=Stationnement extÃ©rieur .
How do I fix this? Any suggestions?
Edit for further clarification:
The user chooses values from a drop down. When the language is French, the values from the drop down sometimes include French characters, like 'è'. When I send this request to the server, it fails, saying it is unable to decipher the request. I have to figure out how to send the 'è' as a different format (preferably Windows-1252) that supports French characters. I have chosen to send as Windows-1252. The server will accept this format. I don't want to replace each character, because I could miss a special character, and then the server will throw an exception.

Use URLEncoder to encode parameter values as application/x-www-form-urlencoded data:
String param = "param="
+ URLEncoder.encode("Stationnement extr\u00e8ieur", "cp1252");
See here for an expanded explanation.

Try using
String encodedURL = new String (unencodedUrl.getBytes("UTF-8"), Charset.forName("Windows-1252"));

As per McDowell's suggestion, I tried encoding doing:
URLEncoder.encode("stringValueWithFrechCharacters", "cp1252") but it didn't work perfectly. I replayced "cp1252" with HTTP.ISO_8859_1 because I believe Android does not have the support for Windows-1252 yet. It does allow for ISO_8859_1, and after reading here, this supports MOST of the French characters, with the exception of 'Œ', 'œ', and 'Ÿ'.
So doing this made it work:
URLEncoder.encode(frenchString, HTTP.ISO_8859_1);
Works perfectly!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.