Java string encoding conversion within a webpage - java

I have a webpage that is encoded (through its header) as WIN-1255.
A Java program creates text string that are automatically embedded in the page. The problem is that the original strings are encoded in UTF-8, thus creating a Gibberish text field in the page.
Unfortunately, I can not change the page encoding - it's required by a customer propriety system.
Any ideas?
UPDATE:
The page I'm creating is an RSS feed that needs to be set to WIN-1255, showing information taken from another feed that is encoded in UTF-8.
SECOND UPDATE:
Thanks for all the responses. I've managed to convert th string, and yet, Gibberish. Problem was that XML encoding should be set in addition to the header encoding.
Adam

To the point, you need to set the encoding of the response writer. With only a response header you're basically only instructing the client application which encoding to use to interpret/display the page. This ain't going to work if the response itself is written with a different encoding.
The context where you have this problem is entirely unclear (please elaborate about it as well in future problems like this), so here are several solutions:
If it is JSP, you need to set the following in top of JSP to set the response encoding:
<%# page pageEncoding="WIN-1255" %>
If it is Servlet, you need to set the following before any first flush to set the response encoding:
response.setCharacterEncoding("WIN-1255");
Both by the way automagically implicitly set the Content-Type response header with a charset parameter to instruct the client to use the same encoding to interpret/display the page. Also see this article for more information.
If it is a homegrown application which relies on the basic java.net and/or java.io API's, then you need to write the characters through an OutputStreamWriter which is constructed using the constructor taking 2 arguments wherein you can specify the encoding:
Writer writer = new OutputStreamWriter(someOutputStream, "WIN-1255");

Assuming you have control of the original (properly represented) strings, and simply need to output them in win-1255:
import java.nio.charset.*;
import java.nio.*;
Charset win1255 = Charset.forName("windows-1255");
ByteBuffer bb = win1255.encode(someString);
byte[] ba = new byte[bb.limit()];
Then, simply write the contents of ba at the appropriate place.
EDIT: What you do with ba depends on your environment. For instance, if you're using servlets, you might do:
ServletOutputStream os = ...
os.write(ba);
We also should not overlook the possible approach of calling setContentType("text/html; charset=windows-1255") (setContentType), then using getWriter normally. You did not make completely clear if windows-1255 was being set in a meta tag or in the HTTP response header.
You clarified that you have a UTF-8 file that you need to decode. If you're not already decoding the UTF-8 strings properly, this should no big deal. Just look at InputStreamReader(someInputStream, Charset.forName("utf-8"))

What's embedding the data in the page? Either it should read it as text (in UTF-8) and then write it out again in the web page's encoding (Win-1255) or you should change the Java program to create the files (or whatever) in Win-1255 to start with.
If you can give more details about how the system works (what's generating the web page? How does it interact with the Java program?) then it will make things a lot clearer.

The page I'm creating is an RSS feed that needs to be set to WIN-1255, showing information taken from another feed that is encoded in UTF-8.
In this case, use a parser to load the UTF-8 XML. This should correctly decode the data to UTF-16 character data (Java Strings are always UTF-16). Your output mechanism should encode from UTF-16 to Windows-1255.

byte[] originalUtf8;//Here input
//utf-8 to java String:
String internal = new String(originalUtf8,Charset.forName("utf-8");
//java string to w1255 String
byte[] win1255 = internal.getBytes(Charset.forName("cp1255"));
//Here output

Related

JSP doesn't display texts from DB in a language other than English

I have the following code in my JSP
... <%
out.println(request.getAttribute("textFromDB")); %> ...
When the JSP is called it just prints question marks (????..) instead of the actual text stored in a MySQL database which is not in English. What can I do to make it display the text correctly. I tried to change the charset and pageEncoding to UTF-8 but it didn't help.
Isn't encoding nice? Unfortunately it's hard to tell where it gets wrong: Your database might store in another character set than UTF-8. Or your database driver might be configured to work in another encoding. Or your server defaults to another one. Or your HTTP connection assumes different encoding and your charset change comes too late.
You'll have to go through all those layers - and keep in mind that everything might look fine and it's been the long-past write-operations to your database that already messed up the data beyond repair.
System.out.println(this.request.getHeader("Content-Encoding")); //check the content type
String data = new String(this.request.getParameter("data").getBytes("ISO-8859-1"), "UTF-8"); //this way it is encoded to byte and then to a string
if the above method didnt work you might wanna check with the database
if it encode characters to "UTF-8"
or
you can configure URIEncoding="UTF-8" in your tomcat setup and just (request.getAttribute("textFromDB")); do the rest.

URL decode ä -> ã1⁄4

I have the problem that the decoding from a URL causes some major problems. The request URL contains %C3%BC as the letter 'ü'. The decoding server side should now decode it as an ü, but it does this: ü
decoding is done like this:
decoded = URLDecoder.decode(value, "UTF-8");
while value contains '%C3%BC' and decoded should now conatain 'ü', but that's where the problem is. What's going wrong here? I use this method in more than one application and it works fine in all other cases...
I don't have enough reputation yet to comment, so I'll have to make this as close to an answer as possible.
If you're using a servlet, and "value" is something that you got from calling getParameter() on the servlet, then it has already been decoded (rightly or wrongly) by the servlet container. (Tomcat?)
Likewise if it's part of the path. Your servlet container probably decoded it assuming that the percent-encoded bytes were ISO-8859-1, which is the default setting for Tomcat. See the document for the URIEncoding attribute of the Connector element in Tomcat's server.xml file, if that's what appserver you're using. If you set it to UTF-8, Tomcat will assume that percent-encoded bytes represent UTF-8 text.
You are probably outputting the value wrong. First decoded.length() (assumedly 1) gives a fair indication; you could dump it too, Arrays.toString(decoded.toCharArray()).
In the IDE console under Windows you could see something like that mess for a Windows single byte ANSI encoding.
For the rest take care of:
String s;
byte[] b;
s.getBytes() -> s.getBytes(StandardCharsets.UTF_8)
s.getBytes("Cp1252") // Windows Latin-1
new String(b) -> new String(b, StandardCharsets.UTF_8)

Broken UTF-8 URI Encoding in JSPs

I got a strange issue with wrong URI Encoding and would appreciate any help!
The project uses JSPs, Servlets, Jquery, Tomcat 6.
Charset in the JSPs is set to UTF-8, all Tomcat connectors use URIEncoding=UTF-8 and I also use a character encoding filter as described here.
Also, I set the contentType in the meta Tag and my browser detects it correctly.
In Ajax calls with Jquery I use encodeURIComponent() on the terms I want to use as URL Parameters and then serialize the whole parameter set with $.param(). In the called servlet these parameters are decoded correctly with Java.net.URLDecoder.decode(term, "UTF-8").
In some places I generate URLs for href elements from a parameter map in the JSPs. Each parameter value is encoded with Java.net.URLEncoder.encode(value, "UTF-8") on JSP side but then decoding it the same way as before results in broken special characters. Instead, I have to encode it as "ISO-8859-2" in the JSP which is then decoded correctly as "UTF-8" in the servlet.
An example for clarifying:
The term "überfall" is URIEncoded via Javascript (%C3%BCberfall) and sent to the servlet for decoding and processing, which works. After passing it back to a JSP I would encode it as UTF-8 and build the URL which results for instance in:
Click here
However, clicking this link will send the parameter as "%C3%83%C2%BCberfall" to the servlet which decodes to "überfall". The same occurs when no encoding takes place.
When, using "ISO-8859-2" for encoding I get:
Click here
When clicking this link I can observe in Wireshark that %C3%BCberfall is sent as parameter which decodes again to "überfall"!
Can anyone tell me where I miss something?
EDIT:
While observing the Network Tab in Firebug I realized that by using
$.param({term : encodeURIComponent(term)});
the term is UTF-8 encoded twice, resulting in "%25C3%25BCberfall", i.e. the percent symbols are also percent-encoded. Analogously, it works for me if I call encode(term, "UTF-8") twice on each value from the parameter map.
Encoding once and not decoding the String results in "überfall" again.
What encoding is Java using internally? Did you start your application with
-Dfile.encoding=utf-8
Please clarify where the "parameter map in the JSPs" is defined. Does it come from some persistent datastorage or are the strings given in your code as literals?
Some thoughts on what is going on, which might help:
ü is what comes out when a UTF-8 encoded ü is read expecting ISO-8859-1, when each byte is decoded on its own. %C3%BC is the URI-encoded representationg of both UTF-8 bytes of a UTF-8 ü. I think this is what's happening:
%C3%BC gets wrongly decoded to → ü which gets encoded to → %C3%83%C2%BC which then gets decoded again to → ü so you end up with überfall.
So I guess, you use the wrong encoding for decoding a URI-encoded string. This might have something to do with the internal encoding used by Java/the JVM:
By default, the JRE 7 installer installs a European languages version if it recognizes that the host operating system only supports European languages.
I think I fixed the problem now definitely.
Following Jontro's comment I encoded all URL parameter values once and removed the manual servlet-side decoding.
Sending an ü should look like %C3%BC in Firebug's Network tab which gave me ü in the servlet.
Java was definitely set to "UTF-8" internal encoding with the -Dfile.encoding parameter.
I traced the problem to the request.getParameter() method like this. request.getQueryString was ok, but when extracting the actual parameters it fails:
request.getCharacterEncoding()) => UTF-8
request.getContentType() => null
request.getQueryString() => from=0&resultCount=10&sortAsc=true&searchType=quick&term=%C3%BC
request.getParameter("term") => ü
Charset.defaultCharset() => UTF-8
OutputStreamWriter.getEncoding() => UTF8
new String(request.getParameter("term").getBytes(), UTF-8) => ü
System.getProperty("file.encoding") => UTF-8
By looking into the sources of Tomcat and Coyote which implement request.getParameter() i found the problem: the URIEncoding from the connector was always null and in this case it defaults to org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING which is "ISO-8859-1" like Wolfram said.
Long story short: my fault was editing the server.xml in Tomcat's conf directory which is only loaded ONCE into Eclipse when a new server is created in the servers view! After that, a separate server.xml in the Servers project has to be edited. After doing so, the connector setting is loaded correctly and everything works as it should.
Thanks for the comments! Hope this helps someone...

Get the source code of a URL

I need to get the source code of the particular URL using a java code. I was able to get the source code for UTF-8 encoded web page but was not able to get the code for ISO-8859-1 encoded character set. My question, is it possible to get the source code of website with iso-8859-1 using a java program? Please help
If you are reading by using following method you need to Specify character set explicitly by
URL url = new URL(URL_TO_READ);
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream(),"ISO-8859-1" ));
How ever if there is little parsing include with your requirement I would suggest you to use JSOUP and it will read the character-set from the response of server, Also you could explicitly set the charset

How to send parameters with same encoding from javascript?

I have a javascript file that lots of people have embedded to their pages. Since I am hosting the file, I have control over that javascript file; I cannot control the way it is embedded because lots of people is using it already.
This javascript file sends GET requests to my servlets, and the parameters passed with the request are recorded to DB. For example, javascript sends a request to http://myserver.com/servlet?p1=123&p2=aString and then servlet records 123 and aString to DB somehow.
Before sending strings I use encodeURIComponent() to encode it. But what I figured out is every client sends the same string with different encodings depending on either their browser or the site they are visiting. As a result, same strings are represented with different characters when it reaches servlet (so they are different strings).
What I am trying to do is to convert the strings to one kind of encoding from javascript so when they reach the client same words are represented with same characters.
How is this possible?
PS. If there is a way to convert the encoding from Java it is also applicable.
Edit: To be more precise, I select some words from the page and send it to the server. That is where encoding causes problems.
Edit 2: I am NOT sending (and can't send) GET requests via XMLHttpRequest, because domains are different. I am using adding script tag to head method that #streetpc mentioned.
Edit 3: At the moment I am sanitizing the strings by replacing non-ASCII characters at javascript side, but I have a feeling that this is not the way to go:
function sanitize(word) {
/*
ğ : \u011f
ü : \u00fc
ş : \u015f
ö : \u00f6
ç : \u00e7
ı : \u0131
û : \u00fb
*/
return encodeURIComponent(
word.replace(/\u011f/g, '_g')
.replace(/\u00fc/g, '_u')
.replace(/\u00fb/g, '_u')
.replace(/\u015f/g, '_s')
.replace(/\u00f6/g, '_o')
.replace(/\u00e7/g, '_c')
.replace(/\u0131/g, '_i'));
}
what I figured out is every client sends the same string with different encodings
Whilst that would be normal for <form> submissions, it should not happen for XMLHttpRequest work. The encodeURIComponent function explicitly always writes URL-encoded UTF-8 bytes, regardless of the encoding of the page from which it was used. Of course persuading your servlet container to allow you to read those UTF-8 bytes without messing them up is another story, but that shouldn't depend on the client.
What might be a problem is if you are using raw non-ASCII characters inside your script file itself. In that case the interpretation of those characters will vary according to the charset the browser is using to load the script. This may be affected by:
any charset declared in the Content-Type: text/javascript;charset= header.
any charset attribute declared on the <script src="..." charset="..."> element.
the charset of the page that included the script.
(1) and (2) are not supported in all browsers. Normally you can rely on (3), but as a third-party script author that is out of your control. Therefore you should use only ASCII characters in your script. (Use \u1234 escapes to include non-ASCII characters in string literals in your script to get around this limitation.)
Do you specify the encoding of the JavaScript file in the HTTP headers? Like Content-type: text/javascript; charset=utf-8 with the .js file beign saved in UTF-8 of course. With Apache, you can configure
AddCharset utf-8 .js
Or you can make the hosted javascript file create another script tag with a charset='utf-8' parameter and add-it to the head element (like most bookmarklets do).
I think the javascript being interpreted as UTF-8 code should then get/manipulate UTF-8 strings.
Then, in your Java Servlet, you can specify the input encoding to use:
request.setCharacterEncoding("UTF-8");
Edit: check this page about Character Encoding in JavaScript, especially the part named "Setting the Character Encoding".

Categories