Non ASCII character sended using Ajax cannot be decoded properly in Java

Non ASCII character sended using Ajax cannot be decoded properly in Java - java

I am sending an AJAX request with jQuery post() and serialize. That uses UTF-8.
For example when 'ś' is a name input value , JavaScript shows name=%C5%9B.
I have tried setting form encoding without success.
<form id="dyn_form" action="dyn_ajax.xml" style="display:none;" accept-charset="UTF-8">
The same happens with encodeURI(document.getElementById("name_id").value). I'm using Servlets on Tomcat 5.5.

I had this kind of problem many times.
Verify your pages are saved in UTF-8 encoding.

If it's really UTF-8, try decodeURIComponent.

I always had a hard time convincing the request object to decode the URIEncoded strings correctly.
I finally made the following hack.
try {
String pvalue = req.getParameter(name);
if (null != pvalue) {
byte[] pbytes = pvalue.getBytes("ISO-8859-1");
res = new String(pbytes, "UTF-8");
}
} catch (java.io.UnsupportedEncodingException e) {
// This should never happen as ISO latin 1 and UTF-8 are always included in jvms.
}
I don't really like this, and it's been a while since I stopped developing servlets, but it was already on tomcat 5.5, so it's worth trying.

Related

Java for Web - Multipart/form-data file with wrong encoding

I am developing a web application with Java and Tomcat 8. This application has a page for uploading a file with the content that will be shown in a different page. Plain simple.
However, these files might contain not-so-common characters as part of their text. Right now, I am working with a file that contains Vietnamese text, for example.
The file is encoded in UTF-8 and can be opened in any text editor. However, I couldn't find any way to upload it and keep the content in the correct encoding, despite searching a lot and trying many different things.
My page which uploads the file contains the following form:
<form method="POST" action="upload" enctype="multipart/form-data" accept-charset="UTF-8" >
File: <input type="file" name="file" id="file" multiple/><br/>
Param1: <input type="text" name="param1"/> <br/>
Param2: <input type="text" name="param2"/> <br/>
<input type="submit" value="Upload" name="upload" id="upload" />
</form>
It also contains:
<%#page contentType="text/html" pageEncoding="UTF-8"%>
...
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
My servlet looks like this:
protected void processRequest(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
try {
response.setContentType("text/html;charset=UTF-8");
request.setCharacterEncoding("UTF-8");
String param1 = request.getParameter("param1");
String param2 = request.getParameter("param2");
Collection<Part> parts = request.getParts();
Iterator<Part> iterator = parts.iterator();
while (iterator.hasNext()) {
Part filePart = iterator.next();
InputStream filecontent = null;
filecontent = filePart.getInputStream();
String content = convertStreamToString(filecontent, "UTF-8");
//Save the content and the parameters in the database
if (filecontent != null) {
filecontent.close();
}
}
} catch (ParseException ex) {
}
}
static String convertStreamToString(java.io.InputStream is, String encoding) {
java.util.Scanner s = new java.util.Scanner(is, encoding).useDelimiter("\\A");
return s.hasNext() ? s.next() : "";
}
Despite all my efforts, I have never been able to get that "content" string with the correct characters preserved. I either get something like "K?n" or "Káº¡n" (which seems to be the ISO-8859-1 interpretation for it), when the correct should be "Kạn".
To add to the problem, if I write Vietnamese characters in the other form parameters (param1 or param2), which also needs to be possible, I can only read them correctly if I set both the form's accept-charset and the servlet scanner encoding to ISO-8859-1, which I definitely don't understand. In that case, if I print the received parameter I get something like "K & # 7 8 4 1 ; n" (without the spaces), which contains a representation for the correct character. So it seems to be possible to read the Vietnamese characters from the form using ISO-8859-1, as long as the form itself uses that charset. However, it never works on the content of the uploaded files. I even tried to encode the file in ISO-8859-1, to use the charset for everything, but it does not work at all.
I am sure this type of situation is not that rare, so I would like to ask some help from the people who might have been there before. I am probably missing something, so any help is appreciated.
Thank you in advance.
Edit 1: Although this question is yet to receive a reply, I will keep posting my findings, in case someone is interested or following it.
After trying many different things, I seem to have narrowed down the causes of problem. I created a class which reads a file from a specific folder in the disk and prints its content. The code goes:
public static void openFile() {
System.out.println(String.format("file.encoding: %s", System.getProperty("file.encoding")));
System.out.println(String.format("defaultCharset: %s", Charset.defaultCharset().name()));
File file = new File(myFilePath);
byte[] buffer = new byte[(int) file.length()];
BufferedInputStream f = null;
String content = null;
try {
f = new BufferedInputStream(new FileInputStream(file));
} catch (FileNotFoundException ex) {
}
try {
f.read(buffer);
content = new String(buffer, "UTF-8");
System.out.println("UTF-8 File: " + content);
f.close();
} catch (IOException ex) {
}
}
Then I added a main function to this class, making it executable. When I run it standalone, I get the following output:
file.encoding: UTF-8
defaultCharset: UTF-8
UTF-8 File: {"...Kạn..."}
However, if run the project as a webapp, as it is supposed to be, and call the same function from that class, I get:
file.encoding: Cp1252
defaultCharset: windows-1252
UTF-8 File: {"...K?n..."}
Of course, this was clearly showing that the default encoding used by the webapp to read the file was not UTF-8. So I did some research on the subject and found the classical answer of creating a setenv.bat for Tomcat and having it execute:
set "JAVA_OPTS=%JAVA_OPTS% -Dfile.encoding=UTF-8"
The result, however, is still not right:
file.encoding: UTF-8
defaultCharset: UTF-8
UTF-8 File {"...Káº¡n..."}
I can see now that the default encoding became UTF-8. The content read from the file, however, is still wrong. The content shown above is the same I would get if I opened the file in Microsoft Word, but chose to read it using ISO-Latin-1 instead of UTF-8. For some odd reason, reading the file is still working with ISO-Latin-1 somewhere, although everything points out to the use of UTF-8.
Again, if anyone might have suggestions or directions for this, it will be highly appreciated.

I don't seem to be able to close the question, so let me contribute with the answer I found.
The problem is that investigating this type of issue is very tricky, since there are many points in the code where the encoding might be changed (the page, the form encoding, the request encoding, file reading, file writing, console output, database writing, database reading...).
In my case, after doing everything that I posted in the question, I lost a lot of time trying to solve an issue that didn't exist any longer, just because the console output in my IDE (NetBeans, for that project) didn't use the desired character encoding. So I was doing everything right to a certain point, but when I tried to print anything I would get it wrong. After I started writing my logs to files, instead of the console, and thus controlling the writing encoding, I started to understand the issue clearly.
What was missing in my solution, after everything I had already described in my question (before the edit), was to configure the encoding for the database connection. To my surprise, even though my database and all of my tables were using UTF-8, the comunication between the application and MySQL was still in ISO-Latin. The last thing that was missing was adding "useUnicode=true&characterEncoding=utf-8" to the connection, just like this:
con = DriverManager.getConnection("jdbc:mysql:///dbname?useUnicode=true&characterEncoding=utf-8", "user", "pass");
Thanks to this answer, amongst many others: https://stackoverflow.com/a/3275661/843668

Reading data from URL returning strange characters [duplicate]

This question already has answers here:
JSON URL from StackExchange API returning jibberish?
(3 answers)
Closed 9 years ago.
I am trying to grab the data from a json file through java. If I navigate to the URL using my browser, everything displays fine, but if I try to get the data using java I get get a bunch of characters that cannot be interpreted or parsed. Note that this code works with other JSON Files. Could this be a server side thing with the way the JSON file is created? I tried messing around with different character sets and that did not seem to fix the problem.
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.minecraftpvp.com/api/ping.json");
URLConnection connection = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
boolean hasLine = true;
while (hasLine) {
String line = in.readLine();
if (line != null) {
System.out.println(line);
} else {
hasLine = false;
}
}
}
The output I get from this is just a ton of strange characters that make no sense at all. Where if I change the url to something like google.com, it works fine.
EDIT: JSON URL from StackExchange API returning jibberish? Seemed to have answered my question. I tried searching before I asked to make sure the answer wasn't here and couldn't find anything. Guess I didn't look hard enough.

Yes that URL is returning gzip encoded content by default.
You can do one of three things:
Explicitly set the Accept-Encoding: header in your request. A web service should not return gzip compression unless it is listed as an accepted encoding in the request, so this website is not being very friendly. Your browser is setting it as accepted I suspect, that is why you can see it there. Just set it to an empty value and it should as per the spec return non-encoded responses, your mileage may vary on this one.
Or use the answer in this How to handle non-UTF8 html page in Java? that shows how to decompress the response. This should be the preferred option over #1.
And/or Ask the person hosting the service to implement the recommended scheme which is to only provide compressed responses if the client says it can handle them or if it can infer it from the browser fingerprint with high confidence.
Best of luck C.

You need to inspect the Content-Encoding header. The URL in question improperly returns gzip-compressed content even when you don't ask for it, and you'll need to run it through a decoder.

response.sendredirect with url with foreign chars - how to encode?

I have a jsf app that has international users so form inputs can have non-western strings like kanjii and chinese - if I hit my url with ..?q=東日本大 the output on the page is correct and I see the q input in my form gets populated fine. But if I enter that same string into my form and submit, my app does a redirect back to itself after constructing the url with the populated parameters in the url (seems redundant but this is due to 3rd party integration) but the redirect is not encoding the string properly. I have
url = new String(url.getBytes("ISO-8859-1"), "UTF-8");
response.sendRedirect(url);
But url redirect ends up being q=???? I've played around with various encoding strings (switched around ISO and UTF-8 and just got a bunch of gibberish in the url) in the String constructor but none seem to work to where I get q=東日本大 Any ideas as to what I need to do to get the q=東日本大 populated in the redirect properly? Thanks.

How are you making your url? URIs can't directly have non-ASCII characters in; they have to be turned into bytes (using a particular encoding) and then %-encoded.
URLEncoder.encode should be given an encoding argument, to ensure this is the right encoding. Otherwise you get the default encoding, which is probably wrong and always to be avoided.
String q= "\u6771\u65e5\u672c\u5927"; // 東日本大
String url= "http://example.com/query?q="+URLEncoder.encode(q, "utf-8");
// http://example.com/query?q=%E6%9D%B1%E6%97%A5%E6%9C%AC%E5%A4%A7
response.sendRedirect(url);
This URI will display as the IRI ‘http://example.com/query?q=東日本大’ in the browser address bar.
Make sure you're serving your pages as UTF-8 (using Content-Type header/meta) and interpreting query string input as UTF-8 (server-specific; see this faq for Tomcat.)

Try
response.setContentType("text/html; charset=UTF-16");
response.setCharacterEncoding("utf-16");

How to convert UTF8 to Unicode

I try to convert a UTF8 string to a Java Unicode string.
String question = request.getParameter("searchWord");
byte[] bytes = question.getBytes();
question = new String(bytes, "UTF-8");
The input are Chinese Characters and when I compare the hex code of each caracter it is the same Chinses character. So I'm pretty sure that the charset is UTF8.
Where do I go wrong?

There's no such thing as a "UTF-8 string" in Java. Everything is in Unicode.
When you call String.getBytes() without specifying an encoding, that uses the platform default encoding - that's almost always a bad idea.
You shouldn't have to do anything to get the right characters here - the request should be handling it all for you. If it's not doing so, then chances are it's lost data already.
Could you give an example of what's actually going wrong? Specify the Unicode values of the characters in the string you're receiving (e.g. by using toCharArray() and then converting each char to an int) and what you expected to receive.
EDIT: To diagnose this, use something like this:
public static void dumpString(String text) {
for (int i = 0; i < text.length(); i++) {
System.out.println(i + ": " + (int) text.charAt(i));
}
}
Note that that will give the decimal value of each Unicode character. If you have a handy hex library method around, you may want to use that to give you the hex value. The main point is that it will dump the Unicode characters in the string.

First make sure that the data is actually encoded as UTF-8.
There are some inconsistency between browsers regarding the encoding used when sending HTML form data. The safest way to send UTF-8 encoded data from a web form is to put that form on a page that is served with the Content-Type: text/html; charset=utf-8 header or contains a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> meta tag.
Now to properly decode the data call request.setCharacterEncoding("UTF-8") in your servlet before the first call to request.getParameter().
The servlet container takes care of the encoding for you. If you use setCharacterEncoding() properly you can expect getParameter() to return normal Java strings.

Also you may need a special filter which will take care of encoding of your requests. For example such filter exists in spring framework org.springframework.web.filter.CharacterEncodingFilter

String question = request.getParameter("searchWord");
is all you have to do in your servlet code. At this point you have not to deal with encodings, charsets etc. This is all handled by the servlet-infrastucture. When you notice problems like displaying �, ?, Ã¼ somewhere, there is maybe something wrong with request the client sent. But without knowing something of the infrastructure or the logged HTTP-traffic, it is hard to tell what is wrong.

possibly.
question = new String(bytes, "UNICODE");

Java string encoding conversion within a webpage

I have a webpage that is encoded (through its header) as WIN-1255.
A Java program creates text string that are automatically embedded in the page. The problem is that the original strings are encoded in UTF-8, thus creating a Gibberish text field in the page.
Unfortunately, I can not change the page encoding - it's required by a customer propriety system.
Any ideas?
UPDATE:
The page I'm creating is an RSS feed that needs to be set to WIN-1255, showing information taken from another feed that is encoded in UTF-8.
SECOND UPDATE:
Thanks for all the responses. I've managed to convert th string, and yet, Gibberish. Problem was that XML encoding should be set in addition to the header encoding.
Adam

To the point, you need to set the encoding of the response writer. With only a response header you're basically only instructing the client application which encoding to use to interpret/display the page. This ain't going to work if the response itself is written with a different encoding.
The context where you have this problem is entirely unclear (please elaborate about it as well in future problems like this), so here are several solutions:
If it is JSP, you need to set the following in top of JSP to set the response encoding:
<%# page pageEncoding="WIN-1255" %>
If it is Servlet, you need to set the following before any first flush to set the response encoding:
response.setCharacterEncoding("WIN-1255");
Both by the way automagically implicitly set the Content-Type response header with a charset parameter to instruct the client to use the same encoding to interpret/display the page. Also see this article for more information.
If it is a homegrown application which relies on the basic java.net and/or java.io API's, then you need to write the characters through an OutputStreamWriter which is constructed using the constructor taking 2 arguments wherein you can specify the encoding:
Writer writer = new OutputStreamWriter(someOutputStream, "WIN-1255");

Assuming you have control of the original (properly represented) strings, and simply need to output them in win-1255:
import java.nio.charset.*;
import java.nio.*;
Charset win1255 = Charset.forName("windows-1255");
ByteBuffer bb = win1255.encode(someString);
byte[] ba = new byte[bb.limit()];
Then, simply write the contents of ba at the appropriate place.
EDIT: What you do with ba depends on your environment. For instance, if you're using servlets, you might do:
ServletOutputStream os = ...
os.write(ba);
We also should not overlook the possible approach of calling setContentType("text/html; charset=windows-1255") (setContentType), then using getWriter normally. You did not make completely clear if windows-1255 was being set in a meta tag or in the HTTP response header.
You clarified that you have a UTF-8 file that you need to decode. If you're not already decoding the UTF-8 strings properly, this should no big deal. Just look at InputStreamReader(someInputStream, Charset.forName("utf-8"))

What's embedding the data in the page? Either it should read it as text (in UTF-8) and then write it out again in the web page's encoding (Win-1255) or you should change the Java program to create the files (or whatever) in Win-1255 to start with.
If you can give more details about how the system works (what's generating the web page? How does it interact with the Java program?) then it will make things a lot clearer.

The page I'm creating is an RSS feed that needs to be set to WIN-1255, showing information taken from another feed that is encoded in UTF-8.
In this case, use a parser to load the UTF-8 XML. This should correctly decode the data to UTF-16 character data (Java Strings are always UTF-16). Your output mechanism should encode from UTF-16 to Windows-1255.

byte[] originalUtf8;//Here input
//utf-8 to java String:
String internal = new String(originalUtf8,Charset.forName("utf-8");
//java string to w1255 String
byte[] win1255 = internal.getBytes(Charset.forName("cp1255"));
//Here output

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Non ASCII character sended using Ajax cannot be decoded properly in Java - java

I had this kind of problem many times. Verify your pages are saved in UTF-8 encoding.

If it's really UTF-8, try decodeURIComponent.

Related

Java for Web - Multipart/form-data file with wrong encoding

Reading data from URL returning strange characters [duplicate]

response.sendredirect with url with foreign chars - how to encode?

How to convert UTF8 to Unicode

Java string encoding conversion within a webpage

Categories

Resources