Java Character Encoding on Google App Engine - java

I am running a GWT application on Google App Engine which passes text input from the GUI via GWT-RPC/Servlet to an API. But umlauts like ä,ö,ü are misinterpreted by the API and the API shows only a ? instead of an umlaut.
I am pretty sure that the problem is the default character encoding on the Google App Engine, which is US-ASCII: US-ASCII does not know any umlaut.
Using umlauts with the API from JUnit-Tests on my local machine works. The default character encoding there is UTF-8.
The problem does not come from GWT or the Encoding with any HTML file; I used a Constant Java String within the appliation containing some umlauts and passed it to the API: the problem appears if the application is deployed in the Google App Engine.
Is there any way to change the Character Encoding in the Google App Engine? Or does anyone know another solution to my problem?
Storing umlauts from the GUI in the GAE Datastore and bringing them back to the GUI works funnily enough.

I was having the same problem: the default charset of a web application deployed to Google App Engine was set to US-ASCII, but I needed it to be UTF-8.
After a bit of head scratching, I found that adding:
<system-properties>
<property name="appengine.file.encoding" value="UTF-8" />
</system-properties>
to appengine-web.xml correctly sets the charset to UTF-8. More details can be found on Google Issue Tracker - Setting of default encoding.

Workaround (safe)
I wrote this class to encode UTF-Strings to ASCII-Strings (replacing all chars which are not in the ASCII-table with their table-number, preceded and followed by a mark), using AsciiEncoder.encode(yourUtfString)
The String can then be decoded back to UTF with AsciiEncoder.decode(yourAsciiEncodedUtfString) where UTF is supported.
package <your_package>;
import java.util.ArrayList;
/**
* Created by Micha F. aka Peracutor.
* 04.06.2017
*/
public class AsciiEncoder {
public static final char MARK = '%'; //use whatever ASCII-char you like (should be occurring not often in regular text)
public static String encode(String s) {
StringBuilder result = new StringBuilder(s.length() + 4 * 10); //buffer for 10 special characters (4 additional chars for every special char that gets replaced)
for (char c : s.toCharArray()) {
if ((int) c > 127 || c == MARK) {
result.append(MARK).append((int) c).append(MARK);
} else {
result.append(c);
}
}
return result.toString();
}
public static String decode(String s) {
int lastMark = -1;
ArrayList<Character> chars = new ArrayList<>();
try {
//noinspection InfiniteLoopStatement
while (true) {
String charString = s.substring(lastMark = s.indexOf(MARK, lastMark + 1) + 1, lastMark = s.indexOf(MARK, lastMark));
char c = (char) Integer.parseInt(charString);
chars.add(c);
}
} catch (IndexOutOfBoundsException | NumberFormatException ignored) {}
for (char c : chars) {
s = s.replace("" + MARK + ((int) c) + MARK, String.valueOf(c));
}
return s;
}
}
Hope this helps someone.

If you (like myself) are using the Java flexible environment on Google AppEngine, the default encoding can "simply" be fixed by setting the file.encoding system property through your app.yaml (via an environment variable that is automatically picked up by the runtime) like this:
env_variables:
JAVA_USER_OPTS: -Dfile.encoding=UTF-8

Related

Emojis and special characters in Discord webhook message from Java not working

So what I'm basically trying to accomplish is I want to copy a user's message from one channel, and using a Webhook, I want to rewrite it out exactly as they input it in another channel. The problem is that emojis come out as '?'s, and many special characters (examples including £, é) completely break it.
My code looks something like this:
package uniqueimpact.discordbot;
import java.io.IOException;
import net.dv8tion.jda.api.events.message.guild.GuildMessageReceivedEvent;
import net.dv8tion.jda.api.hooks.ListenerAdapter;
public class MessageEvent extends ListenerAdapter {
private static final String WEBHOOK = "webhook-url";
public void onGuildMessageReceived(GuildMessageReceivedEvent event) {
if (!event.getAuthor().isBot()) {
String messageSent = event.getMessage().getContentRaw();
String formattedMessage = "";
for (int i = 0; i < messageSent.length(); i++) {
char character = messageSent.charAt(i);
switch (character) {
case '\\':
formattedMessage += "\\\\";
break;
case '\"':
formattedMessage += "\\\"";
break;
case '\n':
formattedMessage += "\\n";
break;
default:
formattedMessage += character;
}
}
String webhook = WEBHOOK;
DiscordWebhook disWebhook = new DiscordWebhook(webhook);
disWebhook.setContent(formattedMessage);
try {
disWebhook.execute();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
This code simply listens for a message, then formats it to escape backslashes, quotes and newlines, and then uses this code which I copied to send the message to a Webhook.
I'm aware that emojis and these special characters are a part of the extended Unicode character set, but I'm not sure what to do with this information. So if anyone knows how I can fix this, that would be very appreciated. :)
I am not sure you even need to escape your special characters, but that's besides the point (Actually, my first try would be just to send the received String as is without ANY modifications). One of the simpler solutions is to convert your String into unicode sequences '\U****'. In this case all your symbols (including emojis) should pass without a glitch. There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
So, I suggest taking your incoming message and just convert it to Unicode sequences and send them over. The receiving side should display it already as symbols. BTW, the same tool can help you to diagnose the problem. You can see what you receive and decode it back to String.
Disclaimer: The library is written by me

How to convert clob to string with encoding in java

We are doing massive batch of xml processing and the logic to convert clob to string is shown below.
import java.sql.Clob
import org.apache.commons.io.IOUtils
String extractXml(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
String sourceXml
try {
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream()), encoding) // 1. Encoding not working
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream(), encoding), encoding) // 2. Encoding working
} catch (Exception e) {
...
}
return sourceXml
}
My queries:
a. I am not sure why (1) doesn't work even though I am using getCharacterStream() instead of getAsciiStream().
but (2) seems to work fine may be I am using explicit overriding of system encoding ?
b. The solution (2) looks bit odd as you are specifing 2 times the encoding format (one for bytes array and one for string creation).
I am not sure if there are any performance issues or wondered if there are better ways to write them?
c. I thought of not using the Apache-commons libraries and use a simple java package solution.
But the suprising thing is, I did not give any explicit encoding but it seems to work perfectly.
Is it because It does "streams character -> straight to string buffering" ?
/*
* working perfectly and retuns encoding correctly
*/
String extractXmlWithoutApacheCommons(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
StringBuffer sb = new StringBuffer((int) xmlClob.length())
try {
Reader r = xmlClob.getCharacterStream()
char[] cbuf = new char[2048]
int n = 0
while ((n = r.read(cbuf, 0, cbuf.length)) != -1) {
if (n > 0) {
sb.append(cbuf, 0, n)
}
}
} catch (Exception e) {
...
}
return sb.toString()
}
Can you guys please shed some light to understand them.
The Clob already has an encoding. It's whatever you've specified in the database, and once you read it on Java side it'll be a String (with the implicit UTF-16 encoding, not that it matters at all).
Whatever you think you're doing with all those encoding tricks is wrong and useless. You only need to specify an encoding when turning bytes to chars or the other way around. You're dealing with chars only (except in your first example where you for some unknown reason want to turn them to bytes).
If you want to use IOUtils, then readFully(Reader input, char[] buffer) would be the method to use.
The platform default encoding has no effect in this whole question, since you shouldn't be working with bytes at all.
Edit:
A slightly more modern way with the standard JDK classes would be to use Reader.read(CharBuffer target) like
CharBuffer cb = CharBuffer.allocate((int) xmlClob.length());
while(r.read(cb) != -1)
;
return cb.toString();
but it doesn't really make a huge difference (it's a bit nicer looking).

How to compare Chinese characters in Java using 'equals()'

I want to compare a string portion (i.e. character) against a Chinese character. I assume due to the Unicode encoding it counts as two characters, so I'm looping through the string with increments of two. Now I ran into a roadblock where I'm trying to detect the '兒' character, but equals() doesn't match it, so what am I missing ? This is the code snippet:
for (int CharIndex = 0; CharIndex < tmpChar.length(); CharIndex=CharIndex+2) {
// Account for 'r' like in dianr/huir
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Also, feel free to suggest a more elegant way to parse this ...
[UPDATE] Some pics from the debugger, showing that it doesn't match, even though it should. I pasted the Chinese character from the spreadsheet I use as input, so I don't think it's a copy and paste issue (unless the unicode gets lost along the way)
oh, dang, apparently it does not work simply copy and pasting:
Use CharSequence.codePoints(), which returns a stream of the codepoints, rather than having to deal with chars:
tmpChar.codePoints().forEach(c -> {
if (c == '兒') {
// ...
}
});
(Of course, you could have used tmpChar.codePoints().filter(c -> c == '兒').forEach(c -> { /* ... */ })).
Either characters, accepting 兒 as substring.
String s = ...;
if (s.contains("兒")) { ... }
int position = s.indexOf("兒");
if (position != -1) {
int position2 = position + "兒".length();
s = s.substring(0, position) + "*" + s.substring(position2);
}
if (s.startsWith("兒", i)) {
// At position i there is a 兒.
}
Or code points where it would be one code point. As that is not really easier, variable substring seem fine.
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Is your problem. 兒 is only one UTF-16 character. Many Chinese characters can be represented in UTF-16 in one code unit; Java uses UTF-16. However, other characters are two code units.
There are a variety of APIs on the String class for coping.
As offered in another answer, obtaining the IntStream from codepoints allows you to get a 32-bit code point for each character. You can compare that to the code point value for the character you are looking for.
Or, you can use the ICU4J library with a richer set of facilities for all of this.

request.getParamaterValues() not supporting UTF-8 while request.getParameter() does

An HTML UTF-8 page (<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>) is using a form with single- and multi-valued fields.
A single valued field sending special characters (such as ä ö ü) is working fine by using request.getParameter(NAME).
However if you use a multi-valued field and try to receive the values via request.getParameterValues(MULTI) then the special characters are not decoded correctly.
Is this a bug in the servlet spec, specifically in the getParameterValues() method or am I missing something?
I discovered this issue in a web application running on Tomcat 5 and Java SE 6.
Meanwhile I got a working solution (workaround). However, I would still be interested why the straight forward approach does not work...
Working solution:
String[] strings = request.getParameterValues("multi");
if (strings != null) {
if (response.getCharacterEncoding().equals("ISO-8859-1")) {
for (int i=0; i<strings.length; i++) {
strings[i] = URLDecoder.decode(new String(strings[i].getBytes("ISO-8859-1"), "UTF-8"), "UTF-8");
}
}
for (String s: strings) {
// do whatever you want with correctly encoded special characters
}
}
Code that does not work (but should?):
String[] strings = request.getParameterValues("multi");
if (strings != null) {
for (String s: strings) {
// special characters in variable s do not appear correctly
}
}

decode character from unicode using java

I was unable to insert a chinese character to mysql. So I though of doing this. I have a excel sheet where I have chinese characters. Like 秀昭 and so on.
I got them converted to unicode representations like \uxxx using below code which I got from SO, and then I stored in MySQL.
private static String escapeNonAscii(String str) {
List<String> arr = new ArrayList<String>();
StringBuilder retStr = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int cp = Character.codePointAt(str, i);
System.out.println("cp="+cp);
int charCount = Character.charCount(cp);
if (charCount > 1) {
i += charCount - 1; // 2.
if (i >= str.length()) {
throw new IllegalArgumentException("truncated unexpectedly");
}
}
if (cp < 128) {
retStr.appendCodePoint(cp);
} else {
retStr.append(String.format("\\u%x", cp));
arr.add(String.format("\\\\u%x", cp));
}
}
return retStr.toString();
}
The values have been stored properly. So now I need to display them back. When I tried
System.out.println("\u8BF7\u5728\u6B64\u5904");
It gives me proper output like,
`请在此`
But when I read from DB and did like
System.out.println(rs.getString(1).trim().toString() + " from DB");
It printed
`\u8BF7\u5728\u6B64\u5904`
What might be the problem? Have I missed anything? please help.
Escaped characters will only be processed prior to compiling. To store and retrieve the data from a database, you only have to consider two things: Make sure the data you read had the correct encoding. And when printing the data the correct encoding is set.
If you read data on a windows machine, it is posible you have to use the cp* encodings. Just use a InputStreamReader and set the charset. Now you have the data in the JVM. The internal encoding is some utf-16. Now that you use a type 4 jdbc, you do not have to worry about encoding, except that your database needs a encoding capable to store the data. UTF-8 or Unicode will to the trick. Consult your jdbc documentation for properties to set. Sometimes you have set an encoding explicitly (jdbc:mysql://localhost:3306/?useUnicode=yes&characterEncoding=UTF-8).
When outputting the data, sometimes the output must have a specific encoding. Normally, your JVM runs with the default system char set but you need another one, for example when rendering a HTML file.

Categories