Storing text on GAE, non-standard unicode characters being changed

Storing text on GAE, non-standard unicode characters being changed - java

I have a servlet on Google App Engine that takes text from the page, stores it as an entity, and later sends it back to the client. When I store the word "You're", I get it showing up in the GAE localstore as "You're" as normal. When I return it to the client, however, I get "Youâre" and the debug code at times reads "Youâ??re". I am using the Java Text class to store this text.
How can I ensure that any Unicode characters can be stored correctly? It looks like client -> server is fine by the fact that the text does not change, but server -> client is definitely screwing up. Thanks!

The majority of times I've seen this problem, either the page doesn't declare that it's using UTF-8, via something like
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
or accept-charset isn't set in the form.
Could either of those be the case here?

Related

java how to decode get url parameter received throw BeanParam

I receive a GET response to this web service
#GET
#Path("/nnnnnn")
public Response pfpfpfpf(#BeanParam NNNNNN n)
The class NNNNN has:
#QueryParam("parameter")
private String parameter;
And for that parameter there is a get and set.
I send a request on a get with a query parameter and it is being bind automatically to my option NNNNN, everything is great.
but, now i am sending Japanese strings in the query url. I encode the paramter by UTF-8 before sending, and I have to decode them using UTF-8.
but my question is where should I call the URLDecoder? i tried to call it in the getter of that parameter, but it didn't work, i kept having something like C3%98%C2%B4%C3%98%C2 instead of the Japanese characters

The solution that works for me is :
on the servlet, i should do this:
request.setCharacterEncoding("UTF-8");
and then on the html page i had to add this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

This is a good question which has potential clear many doubts about how information is processed (encoded and decoded) between systems.
Before I proceed I must say have a fair understanding on Charset, Encoding etc. You may want to read this answer for a quick heads up.
This has to looked from 2 perspectives - browser and server.
Browser perspective of Encoding
Each browser will render the information/text, now to render the information/text it has to know how to interpret those bits/bytes so that it can render correctly (read my answer's 3rd bullet that how same bits can represent different characters in different encoding scheme).
Browser page encoding
Each browser will have a default encoding associated with it. Check this on how to see the default encoding of browser.
If you do not specify any encoding on your HTML page then default encoding of browser will take effect and will render the page as per those encoding rules. so, if default encoding is ASCII and you are using Japanese or Chinese or characters from Unicode supplementary plane then you will see garbage value.
You can tell browser that do not use your default encoding scheme but use this one to render by website, using <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">.
And this exactly what you did/found and you were fine because this meta tag essentially overrode the default encoding of browser.
Another way to achieve same effect is do not use this meta tag but just change the browser's default encoding and still you will be fine. But this is not recommended and using Content-Type meta tag in your JSP is recommended.
Try playing around with browser default encoding and meta tag using below simple HTML.
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
の, は, でした <br></br>
昨夜, 最高
</body>
</html>
Server perspective of Encoding
Server should also know how to interpret the incoming stream of data, which basically means that which encoding scheme to use (server part is tricky because there are several possibilities). Read below from here
When data that has been entered into HTML forms is submitted, the form
field names and values are encoded and sent to the server in an HTTP
request message using method GET or POST, or, historically, via email.
The encoding used by default is based on a very early version of the
general URI percent-encoding rules, with a number of modifications
such as newline normalization and replacing spaces with "+" instead of
"%20". The MIME type of data encoded this way is
application/x-www-form-urlencoded, and it is currently defined (still
in a very outdated manner) in the HTML and XForms specifications. In
addition, the CGI specification contains rules for how web servers
decode data of this type and make it available to applications.
This again has 2 parts that how server should decode the incoming request stream and how it should encode the outgoing response stream.
There are several ways to do this depending upon the use case, for example:
There are methods like setCharacterEncoding, setContentType etc. in HTTP request and response object, which can be used to set the encoding.
This is exactly what you have done in your case that you have told the server that use UTF-8 encoding scheme for decoding the request data because I am expecting advanced Unicode supplementary plane characters. But this is not all, please do read more below.
Set the encoding at server or JVM level, using JVM attributes like -Dfile.encoding=utf8. Read this article on how to set the server encoding.
In your case you were fetching the Japanese characters from query string of the URL and query string is part of HTTP request object, so using request.setCharacterEncoding("UTF-8"); you were able to get the desired encoding result.
But same will not work for URL encoding, which is different from request encoding (your case). Consider below example, in both sysout you will not be able to see the desired encoding effect even after using request.setCharacterEncoding("UTF-8"); because here you want URL encoding since the URL will be something like http://localhost:7001/springapp/forms/executorTest/encodingTest/hellothere 昨夜, 最高 and in this URL there is no query string.
#RequestMapping(value="/encodingTest/{quertStringValue}", method=RequestMethod.GET)
public ModelAndView encodingTest(#PathVariable("quertStringValue") String quertStringValue, ModelMap model, HttpServletRequest request) throws UnsupportedEncodingException {
System.out.println("############### quertStringValue " + quertStringValue);
request.setCharacterEncoding("UTF-8");
System.out.println("############### quertStringValue " + quertStringValue);
return new ModelAndView("ThreadInfo", "ThreadInfo", "####### This is my encoded output " + quertStringValue);
}
Depending upon the framework you are using you may need additional configuration to specify a character encoding for requests or URLs so that you can either apply own encoding if the request does not already specify an encoding, or enforce the encoding in any case. This is useful because current browsers typically do not set a character encoding even if specified in the HTML page or form.
In Spring, there is org.springframework.web.filter.CharacterEncodingFilter for configuring request encoding. Read this similar interesting question which is based on this fact.
In nut shell
Every computer program whether an application server, web server, browser, IDE etc. understands only bits, so it need to know how to interpret the bits to make expected sense out of it because depending upon encoding used, same bits can represent different characters. And that's where "Encoding" comes into picture by giving a unique identifier to represent a character so that all computer programs, diverse OS etc. knows exact right way to interpret it.

Internet Explorer doesn't handle html encoding in URL (GWT)

Using GWT, I've got a webapp, and on a certain page it pulls a parameter from the URL that has the pipe character (|) encoded. So, for example, the full URL would be (in dev mode):
http://127.0.0.1:8888/Home.html?gwt.codesvr=127.0.0.1:9997#DynamicPromo:pk=3%257C1000
and when I pull the parameter "pk" I should get "3|1000". (%257C is the encoded pip char)
Well, this works just fine in Firefox and Chrome.
In IE (I'm using 11), I get "3%7C1000" when I pull the parameter. For whatever reason, IE drops the 25 in the encoded character, meaning it's no longer a pipe char and my app breaks.
I've read around and found that encoding issues are common on IE. In particular, I found this page: http://support.microsoft.com/kb/928847
It's suggested solutions include:
Disable the Auto-Select setting in Internet Explorer.
Provide the character set in the HTTP headers.
Move the META tag to within the first kilobyte of data that is parsed
by MSHTML.
I've tried those 3 and it didn't help. Here is the beginning of my Home.html:
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
The other two suggestions:
Increase the size of the server's initial HTTP response. The initial
size should be at least 1 KB.
Make sure that the System Locale setting matches the character set of
the META tag that is specified in the HTML page.
I don't feel will do anything. My system locale settings are correct. And since my meta tags are at the beginning of the document, they are within the first kilobyte of data, so they would be read first. So I don't see why I'd need to increase the HTTP response size.
So, I need IE to properly read this encoded character for the web application to work properly. Does anyone have any other suggestions I could try?
UPDATE:
How the URL is encoded:
URL.encodePathSegment(place.getValue())
Where URL is from the package com.google.gwt.http.client
getValue() is set from this:
public static String encodePk(PrimaryKey pk)
{
if(pk != null)
{
return String.valueOf(pk.getPk()).concat("|").concat(String.valueOf(pk.getCpk()));
}
else{
return "";
}
}
The final result is the url I posted at the top:
http://127.0.0.1:8888/Home.html?gwt.codesvr=127.0.0.1:9997#DynamicPromo:pk=3%257C1000
Where the part after "pk=" is the encoded string.

In order to make sure IE kept the encoding in tact, I had to first decode the URL as soon as I set it:
public void setValue(String value)
{
this.value = unescape(value);
}
private static native String decodeURI( String s )
/*-{
return decodeURI(s);
}-*/;
Thanks a lot for the help!

Try JavaScript encodeURIComponent() Function to encode a string. This function makes a string portable, so it can be transmitted across any network to any computer that supports ASCII characters.
This function encodes special characters.
In addition, it encodes the following characters: , / ? : # & = + $ #
For more info click HERE.
Here is a sample code using JSNI:
public static final native String encodeURIComponent(String uri) /*-{
return encodeURIComponent(uri);
}-*/;

Why does my Unicode String get corrupted, when passed from Java Applet to Java Script?

I'm pretty new, so don't be too harsh :)
Question(tl;dr)
I'm facing a problem passing an unicode String from an embedded javax.swing.JApplet in a web page to the Java Script part. I'm not sure this is whether a bug or a misunderstanding of the involved technologies:
Problem
I want to pass a unicode string from a Java Applet to Java Script, but the String gets messed up. Strangely, the problem doesn't occur not in Internet Explorer 10 but in Chrome (v26) and Firefox (v20). I haven't tested other browsers though.
The returned String seems to be okay, except for the last unicode character. The result in the Java Script Debugger and Web Page would be:
abc → abc
表示 → 表��
ま → ま
ウォッチリスト → ウォッチリス��
アップロード → アップロー��
ホ → ��
ホ → ホ (Not deterministic)
アップロードabc → アップロードabc
The string seems to get corrupted at the last bytes. If it ends with an ASCII character the string is okay. Additionally the problem doesn't occur within every combination and also not every time (not sure on this). Therefore I suspect a bug and I'm afraid I might be posting an invalid question.
Test Set Up
A minimalistic set up includes an applet that returns some unicode (UTF-8) strings:
/* TestApplet.java */
import javax.swing.*;
public class TestApplet extends JApplet {
private String[] testStrings = {
"abc", // OK (because ASCII only)
"表示", // Error on last Character
"表示", // Error on last Character
"ホーム ", // OK (because of *space* after ム)
"アップロード", ... };
public TestApplet() {...}; // Applet specific stuff
...
public int getLength() { return testStrings.length;};
String getTestString(int i) {
return testStrings[i]; // Build-in array functionality because of IE.
}
}
The corresponding web page with java script could look like this:
/* test.html */
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<span id="output"/>
<applet id='output' archive='test.jar' code=testApplet/>
</body>
<script type="text/javascript" charset="utf-8">
var applet = document.getElementById('output');
var node = document.getElementById("1");
for(var i = 0; i < applet.getLength(); i++) {
var text = applet.getTestString(i);
var paragraphNode = document.createElement("p");
paragraphNode.innerHTML = text;
node.appendChild(paragraphNode);
}
</script>
</html>
Environment
I'm working on Windows 7 32-Bit with the current Java Version 1.7.0_21 using the "Next Generation Java Plug-in 10.21.2 for Mozilla browsers". I had some problems with my operating system locale, but I tried several (English, Japanese, Chinese) regional settings.
In case of an corrupt String chrome shows invalid characters (e.g. ��). Firefox, on the other hand, drops the string completly, if it would be ending with ��.
Internet explorer manages to display the strings correctly.
Solutions?
I can imagine several workarounds, including escaping/unescaping and adding a "final char" which then is removed via java script. Actually I'm planning to write against Android's Webkit, and I haven't tested it there.
Since I would like to continue testing in Chrome, (because of Webkit technology and comfort) I hope there is a trivial solution to the problem, which I might have overlooked.

If you are testing in Chrome/Firefox
Please replace first line with this and then test it,
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
The Doctype has significant value while browser identifies the page.
Transitional /loose it the types you can use with Unicode. Please test and reply..

I suggest to set a breakpoint on
paragraphNode.innerHTML = text;
and inspect text it in the JavaScript console, e.g. with
console.log(escape(text));
or
console.log(encodeURIComponent(text));
or
for (i=0; i < text.length; i++) {
console.log("i = "+i);
console.log("text.charAt(i) = "+text.charAt(i)
+", text.charCodeAt(i) = "+text.charCodeAt(i));
}
See also
http://www.fileformat.info/info/unicode/char/30a6/index.htm
https://developer.mozilla.org/en-US/docs/DOM/window.escape (which is not part of any standard)
and
https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/encodeURIComponent
or similar resources.
Your source files may not be in the encoding you assume (UTF-8).
JavaScript assumes UTF-16 strings:
http://www.ecma-international.org/ecma-262/5.1/#sec-4.3.16
Java also assumes UTF-16:
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html
The Linux or Cygwin file command can show you the encoding of your files.
See
http://linux.die.net/man/1/file (haven't found a kernel.org man reference)

You need to make sure to add the following Java Argument to your applet/embed tag:
-Dfile.encoding=utf-8
i.e. java_arguments="-Dfile.encoding=utf-8"
Otherwise it is going to expect and treat the applet as ASCII text.

Okay, I'm a little bit embarassed, because I thought I tried it enough: I was actually using non-latin locale (e.g Chinese(PRC) or Japanese(Japan) in the windows' system locale settings. When I changed back to English(USA) or German(Germany) everything worked as excpected.
I'm still wondering, why it would affect Chrome & Mozilla in such a strange way, because Java and modern browsers should be unicode-based; So I won't accept this as an answer! The problem reoccurs by switching back to japanese and I'm going to test it on different systems.
I want to thank for all the posters for the enlightning input... and I will still putting some effort in solving this question.

Accessing a xml file and updating it in Java

Is it possible to write a little Javaa program which parses a xml file from my web hosting site and updates this file? Or is there a better alternative to do so? I have to update the file every 10 min with about 10 lines of code each, so I don't want to write it out every time.

You can write little java program. BTW you can write a bigger one two :).
You can write program using any language you want. Including Java.
The program written in any language can parse XML.
Well, now we arrived to the problem. What do you mean when you say that you wish to parse XML from the web site? Does your web site provides URL that allows to download the XML? In this case you can download it (e.g. using HTTP GET method) and parse.
The next problem is how to update the XML on the site. You have to provide such functionality on site itself (e.g. implement service that is able to receive the XML and store it. For example via HTTP GET.
Once you are done you can write truly little java program that downloads the file using HTTP GET, parses it, creates new one and the sends it back to the site using HTTP POST.

I would investigate running this code on your server. What is the update? Does it use data not on your server? You can do this easily in Java but more detail is needed for a better answer.
OK, your idea is fine if your web hosting lets you do http PUT you can get the file using GET, modify it, e.g. using the DOM, and PUT it back. You might prefer to do more server side scripting and write the update at that end, it lets you use relational databases instead of flat files for example. In this case, writing a server side script to accept a POST as the other answer suggested is a good idea.
I've put a little example for you here: http://jcable.users.sourceforge.net/scores.php
There are three files on this website:
scores - a text file that gets updated.
scores.php - a script that shows the current score
add.php - a script that updates the current score.
scores.php looks like this:
<html>
<body>
The current score is <?php readfile("scores"); ?>
</body>
</html>
add.php looks like this:
<?php
if(isset($_POST["score"])) {
$fh = fopen("scores", 'w') or die("can't open file");
fwrite($fh, $_POST["score"]);
fclose($fh);
}
else
{
?>
<HTML>
<BODY>
<FORM action="add.php" method="POST">
New Score: <INPUT type="text" name="score"/>
</FORM>
</BODY>
</HTML>
<?php
}
?>
if you get add.php it will present you the form. But if your program calls it as a post it won't bother. Hope this gives you some ideas - its the simplest possible web app I can think of that has persistent server side data. You can add complexity - xml or json, etc., but the principles are there.

HTML + Javascript Renderer that outputs HTML or plaintext?

If I use:
String plain = Html.fromHtml(html).toString;
to render simple 'html' that contains:
<!doctype html>
<html>
<head><meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Google</title>
</head>
<body>any plain vanila HTML goes here
</body>
All is nice and dandy.
But what if that page contains tons of Javascript code that is nicely rendered by all web browsers but isn't available to me?
Is there a renderer that takes care of the Javascript as well, to output HTML or plaintext, that isn't necessarily going to a visual display?
(I know about WebView but my understanding that I can't really access its output. Or can I?)

Is there a renderer that takes care of the Javascript as well, to output HTML or plaintext, that isn't necessarily going to a visual display?
WebView or bust.
(I know about WebView but my understanding that I can't really access its output. Or can I?)
Create a Java object to receive your output
Add that Java object to the WebView via addJavascriptInterface()
Use loadUrl("javascript:...") on the WebView to invoke a hunk of Javascript that gathers your information and calls a method on your Java object

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.