Charset trouble (ø as 00F8) - java

I am getting a string from our database (third party tool) - and I have a trouble with one
name - sometimes it is right "Tarsøy", and all runs smoothly but sometimes it is "Tars00F8y".
And this ruins the process - I have tried to write some validator function via URLDecoder.decode(name, "UTF-8") that gets a string and return validated one but not succeed.
this is how I get a sting from our base:
Database.WIKI.get(index); // the index is the ID of the string
// this is no sql DB
now about "sometimes" - it means that this code just works different =) I think that is connected with inner DB exceptions or so. So I am trying to do something like validate(Database.WIKI.get(index))
May be I should try something like Encode String to UTF-8

In Java, JavaScript and (especially interesting) JSON there exists the notation \u00F8 for ø. I think this was sent to the database, maybe from a specific browser on a specific computer locale. \u disappeared and voilà. Maybe it is still as invisible control character in the string. That would be nice for repairs.
My guess is JSON data; however normally JSON libraries should parse u-escaped characters. That is weird.
Check what happens when storing "x\\u00FDx". Is the char length 6 or maybe 7 (lucky).
Some sanity checks: assuming you work in UTF-8, especially if the data arrive by HTML or JS:
Content-Type header text/html; charset=UTF-8
(Optional) meta tag with charset=UTF-8
<form action="..." accept-charset="UTF-8">
JSON: contentType: "application/json; charset=UTF-8"

Related

request.getParameter() returns corrupted data - Java

In my project, from UI I am passing a string to the server like account<s using HTTP post method. This value is fetched in backend using request.getParameter() method of HTTPServlet. The getParameter() returns an encoded string. The account<s value is fetched as account& lt;s
Now in UI I need to display account<s. If the value is encoded as account<s, then I can use html decoding in the UI part. But the encoded string has an additional space. Instead of <, I am getting & lt;.
jQuery Code:
var params = {};
params.passVal = "account<s";
//ajax call
$.ajax({
type:"POST",
url:url,
data:params,
datatype:"json",
async:false
}).success(function(json){
//success notification
});
Java Code:
String receivedVal = request.getParameter("passVal"); //account& lt;s
I am using Apache Tomcat 7 and jquery v2.1.3
For all the encoded characters, a space is added between the 1st and the 2nd character. Why is it behaving like this? And how can I get the original data in Java?
This problem occurred because of a servlet filter class, in which the encoding process is defined. Instead of <, it is coded as & lt. Thanks a lot #tak3shi for pointing out the root cause.
An HTML entity (&LT;) is not URL encoding; you need to encode the < as %3c.

How to save arabic words into oracle database?

I want to save the arabic word into oracle database. User type a arabic word from client side and submit that word.In client side I printed that word by using alert it is shown arabic text. But the word shown in server side, java console (by using Sytem.out.println) as شاحÙØ©. So it is shown in db as ????. I saw the related post, in one of the post discuss changing the 'text file encoding' into UTF-8 in Eclipse, I changed the 'text file encoding' into UTF-8. But no effect it is showing previous characters like شاحÙØ©. Then I changed the applications 'text file encoding' into UTF-8 , then got same output. I think the word sending into db like this that is why db shows as ????. Is any solution?
my code is in Java
vehicleServiceModel.setVehicleType(request.getParameter("vehicleType"));
System.out.println("vehicle Type : "+vehicleServiceModel.getVehicleType());
client side
jQuery.ajax({
type: "GET",
cache : false,
url: "addvehicle.htm",
data:{vehName:VehicleName,make:Make,model:Model,color:Color,plateNumber:PlateNumber,driverName:DriverName,vehicleType:VehicleType,vehTimeZone:vehTimeZone},
contentType: "application/json; charset=utf-8",
dataType: "json",
success: Success,
error: Error
});
function Success(data, status) {
//some code
}
I am answering this question myself.Hopefully this will help some other.
My issue resolved by :
I changed
in java:
vehicleServiceModel.setVehicleType(request.getParameter("vehicleType"));
String str = vehicleServiceModel.getVehicleType();
str = new String(str.getBytes("8859_1"), "UTF-8");
System.out.println("vehicle Type : "+str);
vehicleServiceModel.setVehicleType(str);
Now it is resolved and arabic words saved into the database.
For more details Please have a look into this
In Command line, you need to set the code page to 1256 (Arabic). But storing arabic texts in a database, you need to set the column data to UTF-8. Also, make sure that your charset is set to UTF-8 (if you're doing a web page).
I suggest you use UTF-8 all the way, i.e. in the web page, in Eclipse and your source code, and in the database (NLS_CHARACTERSET or define the column as NVARCHAR2). This way you will not need conversions.

How to convert UTF8 to Unicode

I try to convert a UTF8 string to a Java Unicode string.
String question = request.getParameter("searchWord");
byte[] bytes = question.getBytes();
question = new String(bytes, "UTF-8");
The input are Chinese Characters and when I compare the hex code of each caracter it is the same Chinses character. So I'm pretty sure that the charset is UTF8.
Where do I go wrong?
There's no such thing as a "UTF-8 string" in Java. Everything is in Unicode.
When you call String.getBytes() without specifying an encoding, that uses the platform default encoding - that's almost always a bad idea.
You shouldn't have to do anything to get the right characters here - the request should be handling it all for you. If it's not doing so, then chances are it's lost data already.
Could you give an example of what's actually going wrong? Specify the Unicode values of the characters in the string you're receiving (e.g. by using toCharArray() and then converting each char to an int) and what you expected to receive.
EDIT: To diagnose this, use something like this:
public static void dumpString(String text) {
for (int i = 0; i < text.length(); i++) {
System.out.println(i + ": " + (int) text.charAt(i));
}
}
Note that that will give the decimal value of each Unicode character. If you have a handy hex library method around, you may want to use that to give you the hex value. The main point is that it will dump the Unicode characters in the string.
First make sure that the data is actually encoded as UTF-8.
There are some inconsistency between browsers regarding the encoding used when sending HTML form data. The safest way to send UTF-8 encoded data from a web form is to put that form on a page that is served with the Content-Type: text/html; charset=utf-8 header or contains a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> meta tag.
Now to properly decode the data call request.setCharacterEncoding("UTF-8") in your servlet before the first call to request.getParameter().
The servlet container takes care of the encoding for you. If you use setCharacterEncoding() properly you can expect getParameter() to return normal Java strings.
Also you may need a special filter which will take care of encoding of your requests. For example such filter exists in spring framework org.springframework.web.filter.CharacterEncodingFilter
String question = request.getParameter("searchWord");
is all you have to do in your servlet code. At this point you have not to deal with encodings, charsets etc. This is all handled by the servlet-infrastucture. When you notice problems like displaying �, ?, ü somewhere, there is maybe something wrong with request the client sent. But without knowing something of the infrastructure or the logged HTTP-traffic, it is hard to tell what is wrong.
possibly.
question = new String(bytes, "UNICODE");

How to send parameters with same encoding from javascript?

I have a javascript file that lots of people have embedded to their pages. Since I am hosting the file, I have control over that javascript file; I cannot control the way it is embedded because lots of people is using it already.
This javascript file sends GET requests to my servlets, and the parameters passed with the request are recorded to DB. For example, javascript sends a request to http://myserver.com/servlet?p1=123&p2=aString and then servlet records 123 and aString to DB somehow.
Before sending strings I use encodeURIComponent() to encode it. But what I figured out is every client sends the same string with different encodings depending on either their browser or the site they are visiting. As a result, same strings are represented with different characters when it reaches servlet (so they are different strings).
What I am trying to do is to convert the strings to one kind of encoding from javascript so when they reach the client same words are represented with same characters.
How is this possible?
PS. If there is a way to convert the encoding from Java it is also applicable.
Edit: To be more precise, I select some words from the page and send it to the server. That is where encoding causes problems.
Edit 2: I am NOT sending (and can't send) GET requests via XMLHttpRequest, because domains are different. I am using adding script tag to head method that #streetpc mentioned.
Edit 3: At the moment I am sanitizing the strings by replacing non-ASCII characters at javascript side, but I have a feeling that this is not the way to go:
function sanitize(word) {
/*
ğ : \u011f
ü : \u00fc
ş : \u015f
ö : \u00f6
ç : \u00e7
ı : \u0131
û : \u00fb
*/
return encodeURIComponent(
word.replace(/\u011f/g, '_g')
.replace(/\u00fc/g, '_u')
.replace(/\u00fb/g, '_u')
.replace(/\u015f/g, '_s')
.replace(/\u00f6/g, '_o')
.replace(/\u00e7/g, '_c')
.replace(/\u0131/g, '_i'));
}
what I figured out is every client sends the same string with different encodings
Whilst that would be normal for <form> submissions, it should not happen for XMLHttpRequest work. The encodeURIComponent function explicitly always writes URL-encoded UTF-8 bytes, regardless of the encoding of the page from which it was used. Of course persuading your servlet container to allow you to read those UTF-8 bytes without messing them up is another story, but that shouldn't depend on the client.
What might be a problem is if you are using raw non-ASCII characters inside your script file itself. In that case the interpretation of those characters will vary according to the charset the browser is using to load the script. This may be affected by:
any charset declared in the Content-Type: text/javascript;charset= header.
any charset attribute declared on the <script src="..." charset="..."> element.
the charset of the page that included the script.
(1) and (2) are not supported in all browsers. Normally you can rely on (3), but as a third-party script author that is out of your control. Therefore you should use only ASCII characters in your script. (Use \u1234 escapes to include non-ASCII characters in string literals in your script to get around this limitation.)
Do you specify the encoding of the JavaScript file in the HTTP headers? Like Content-type: text/javascript; charset=utf-8 with the .js file beign saved in UTF-8 of course. With Apache, you can configure
AddCharset utf-8 .js
Or you can make the hosted javascript file create another script tag with a charset='utf-8' parameter and add-it to the head element (like most bookmarklets do).
I think the javascript being interpreted as UTF-8 code should then get/manipulate UTF-8 strings.
Then, in your Java Servlet, you can specify the input encoding to use:
request.setCharacterEncoding("UTF-8");
Edit: check this page about Character Encoding in JavaScript, especially the part named "Setting the Character Encoding".

Java string encoding conversion within a webpage

I have a webpage that is encoded (through its header) as WIN-1255.
A Java program creates text string that are automatically embedded in the page. The problem is that the original strings are encoded in UTF-8, thus creating a Gibberish text field in the page.
Unfortunately, I can not change the page encoding - it's required by a customer propriety system.
Any ideas?
UPDATE:
The page I'm creating is an RSS feed that needs to be set to WIN-1255, showing information taken from another feed that is encoded in UTF-8.
SECOND UPDATE:
Thanks for all the responses. I've managed to convert th string, and yet, Gibberish. Problem was that XML encoding should be set in addition to the header encoding.
Adam
To the point, you need to set the encoding of the response writer. With only a response header you're basically only instructing the client application which encoding to use to interpret/display the page. This ain't going to work if the response itself is written with a different encoding.
The context where you have this problem is entirely unclear (please elaborate about it as well in future problems like this), so here are several solutions:
If it is JSP, you need to set the following in top of JSP to set the response encoding:
<%# page pageEncoding="WIN-1255" %>
If it is Servlet, you need to set the following before any first flush to set the response encoding:
response.setCharacterEncoding("WIN-1255");
Both by the way automagically implicitly set the Content-Type response header with a charset parameter to instruct the client to use the same encoding to interpret/display the page. Also see this article for more information.
If it is a homegrown application which relies on the basic java.net and/or java.io API's, then you need to write the characters through an OutputStreamWriter which is constructed using the constructor taking 2 arguments wherein you can specify the encoding:
Writer writer = new OutputStreamWriter(someOutputStream, "WIN-1255");
Assuming you have control of the original (properly represented) strings, and simply need to output them in win-1255:
import java.nio.charset.*;
import java.nio.*;
Charset win1255 = Charset.forName("windows-1255");
ByteBuffer bb = win1255.encode(someString);
byte[] ba = new byte[bb.limit()];
Then, simply write the contents of ba at the appropriate place.
EDIT: What you do with ba depends on your environment. For instance, if you're using servlets, you might do:
ServletOutputStream os = ...
os.write(ba);
We also should not overlook the possible approach of calling setContentType("text/html; charset=windows-1255") (setContentType), then using getWriter normally. You did not make completely clear if windows-1255 was being set in a meta tag or in the HTTP response header.
You clarified that you have a UTF-8 file that you need to decode. If you're not already decoding the UTF-8 strings properly, this should no big deal. Just look at InputStreamReader(someInputStream, Charset.forName("utf-8"))
What's embedding the data in the page? Either it should read it as text (in UTF-8) and then write it out again in the web page's encoding (Win-1255) or you should change the Java program to create the files (or whatever) in Win-1255 to start with.
If you can give more details about how the system works (what's generating the web page? How does it interact with the Java program?) then it will make things a lot clearer.
The page I'm creating is an RSS feed that needs to be set to WIN-1255, showing information taken from another feed that is encoded in UTF-8.
In this case, use a parser to load the UTF-8 XML. This should correctly decode the data to UTF-16 character data (Java Strings are always UTF-16). Your output mechanism should encode from UTF-16 to Windows-1255.
byte[] originalUtf8;//Here input
//utf-8 to java String:
String internal = new String(originalUtf8,Charset.forName("utf-8");
//java string to w1255 String
byte[] win1255 = internal.getBytes(Charset.forName("cp1255"));
//Here output

Categories