Unable to retrieve special characters like ® while using HttpServletResponseWrapper - java

I am unable to retrieve a special character (®) while using HttpServletResponseWrapper, processing the response with RequestDispatcher include.
I tried setting the character encoding to UTF-8 and also did try the configuration changes in WEB.xml. Still I am not able to retrieve the character and moreover, I am not even seeing an encoded string in place of the character.
Any help is appreciated.

Related

How to handle  (object replacement character) in URL

Using Jsoup to scrape URLS and one of the URLS I keep getting has this  symbol in it. I have tried decoding the URL:
url = URLDecoder.decode(url, "UTF-8" );
but it still remains in the code looking like this:
I cant find much online about this other than it is "The object replacement character, sometimes used to represent an embedded object in a document when it is converted to plain text."
But if this is the case I should be able to print the symbol if it is plain text but when I run
System.out.println("");
I get the following complication error:
and it reverts back to the last save.
Sample URL: https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/
NOTE: If you decode the url then compare it to the decoded url it comes back as not the same e.g.:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
if(url.contains("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles?/")){
System.out.println("The same");
}else {
System.out.println("Not the same");
}
That's not a compilation error. That's the eclipse code editor telling you it can't save the source code to a file, because you have told it to save the file in a cp1252 encoding, but that encoding can't express a .
Put differently, your development environment is currently configured to store source code in the cp1252 encoding, which doesn't support the character you want, so you either configure your development environment to store source code using a more flexible encoding (such as UTF-8 the error message suggests), or avoid having that character in your source code, for instance by using its unicode escape sequence instead:
System.out.println("\ufffc");
Note that as far as the Java language and runtime are concerned,  is a character like any other, so there may not be a particular need to "handle" it. Also, I am unsure why you'd expect URLDecoder to do anything if the URL hasn't been URL-encoded to begin with.
"ef bf bc" is a 3 bytes UTF-8 character so as the error says, there's no representation for that character in "CP1252" Windows page encoding.
An option could be to replace that percent encoding sequence with an ascii representation to make the filename for saving:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/".replace("%ef%bf%bc", "-xEFxBFxBC"), "UTF-8");
url ==> "https://www.breightgroup.com/job/hse-advisor-emb ... contract-roles-xEFxBFxBC/"
Another option using CharsetDecoder
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
CharsetDecoder decoder = Charset.forName("CP1252").newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE);
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
ByteBuffer buffer = ByteBuffer.wrap(urlDec.getBytes(Charset.forName("UTF-8")));
decoder.decode(buffer).toString();
Result
"https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles/"
I found the issue resolved by just replacing URLs with this symbol because there are other URLs with Unicode symbols that were invisible that couldnt be converted ect..
So I just compared the urls to the following regex if it returns false then I just bypass it. Hope this helps someone out:
boolean newURL = url.matches("^[a-zA-Z0-9_:;/.&|%!+=#?-]*$");

Sending special characters via URL Java

Say I have a string with "é" in it, and I want to send it over a URL to the next controller, the characters gets encoded to %C3%A9, and when it's received in the other controller, it gets decoded to "é".
My question is, how to encode the "é" over the URL so when it's received in the other controller it gets decoded to "é"? For now, I'm replacing them manually. I need a way to do it automatically and with any special character (éèà...)
Thank you.
Unfortunately there is no way to declare the encoding of URL data. The common encoding used to be ISO-8859-1 or Latin1, but nowadays, UTF-8 is often used in new developments. The reasons was that the servlet specification says that when the charset is not specified, ISO-8859-1 is implied, but HTML 4.0 recommends UTF-8 for URLs.
The problem is that the URL is composed of bytes and that the servlet container convert it to java characters before passing it to the application, so you must declare the used charset at the servlet container level. For compatibily reasons, the well known Tomcat before version 8.0 assumed by default a Latin1 charset for the URL. Since 8.0.0 the default now depends on the "strict servlet compliance" setting. It is ISO-8859-1 when true and UTF-8 when false
References:
What character set should I assume the encoded characters in a URL to be in?
Tomcat FAQ/CharacterEncoding
For your precise question, you have two ways to correctly process the é character:
leave your servlet container configuration unchanged and encode the URL in ISO-8859-1 (é will be %E9)
stick to UTF-8 in the URL but declare it to the servlet container

Java+HtmlUnit — problem with cyrillic urlencode

I am trying to send some HTTP POST parameters to some web server and one of parameters contains cyrillic characters. So the problem is that if I use this code:
wc.getPage(requestSettings);
requestSettings.setHttpMethod(HttpMethod.POST);
requestSettings.setRequestParameters(new ArrayList());
requestSettings.getRequestParameters().add(new NameValuePair("username", "Друже бобер"));
wc.getPage(requestSettings);
Server will recieve the next urlencoded parameter:
And this is wrong decoded string "Друже бобер".
So I think that HtmlUnit encode url in core with using ASCII not Unicode. How to disable url encoding or how to fix this bug? If I'll encode this string and set to NameValuePair so all percent characters will be encoded by HtmlUnit to.
I think you need to set the charset using the setCharset method.

response.sendredirect with url with foreign chars - how to encode?

I have a jsf app that has international users so form inputs can have non-western strings like kanjii and chinese - if I hit my url with ..?q=東日本大 the output on the page is correct and I see the q input in my form gets populated fine. But if I enter that same string into my form and submit, my app does a redirect back to itself after constructing the url with the populated parameters in the url (seems redundant but this is due to 3rd party integration) but the redirect is not encoding the string properly. I have
url = new String(url.getBytes("ISO-8859-1"), "UTF-8");
response.sendRedirect(url);
But url redirect ends up being q=???? I've played around with various encoding strings (switched around ISO and UTF-8 and just got a bunch of gibberish in the url) in the String constructor but none seem to work to where I get q=東日本大 Any ideas as to what I need to do to get the q=東日本大 populated in the redirect properly? Thanks.
How are you making your url? URIs can't directly have non-ASCII characters in; they have to be turned into bytes (using a particular encoding) and then %-encoded.
URLEncoder.encode should be given an encoding argument, to ensure this is the right encoding. Otherwise you get the default encoding, which is probably wrong and always to be avoided.
String q= "\u6771\u65e5\u672c\u5927"; // 東日本大
String url= "http://example.com/query?q="+URLEncoder.encode(q, "utf-8");
// http://example.com/query?q=%E6%9D%B1%E6%97%A5%E6%9C%AC%E5%A4%A7
response.sendRedirect(url);
This URI will display as the IRI ‘http://example.com/query?q=東日本大’ in the browser address bar.
Make sure you're serving your pages as UTF-8 (using Content-Type header/meta) and interpreting query string input as UTF-8 (server-specific; see this faq for Tomcat.)
Try
response.setContentType("text/html; charset=UTF-16");
response.setCharacterEncoding("utf-16");

How to send parameters with same encoding from javascript?

I have a javascript file that lots of people have embedded to their pages. Since I am hosting the file, I have control over that javascript file; I cannot control the way it is embedded because lots of people is using it already.
This javascript file sends GET requests to my servlets, and the parameters passed with the request are recorded to DB. For example, javascript sends a request to http://myserver.com/servlet?p1=123&p2=aString and then servlet records 123 and aString to DB somehow.
Before sending strings I use encodeURIComponent() to encode it. But what I figured out is every client sends the same string with different encodings depending on either their browser or the site they are visiting. As a result, same strings are represented with different characters when it reaches servlet (so they are different strings).
What I am trying to do is to convert the strings to one kind of encoding from javascript so when they reach the client same words are represented with same characters.
How is this possible?
PS. If there is a way to convert the encoding from Java it is also applicable.
Edit: To be more precise, I select some words from the page and send it to the server. That is where encoding causes problems.
Edit 2: I am NOT sending (and can't send) GET requests via XMLHttpRequest, because domains are different. I am using adding script tag to head method that #streetpc mentioned.
Edit 3: At the moment I am sanitizing the strings by replacing non-ASCII characters at javascript side, but I have a feeling that this is not the way to go:
function sanitize(word) {
/*
ğ : \u011f
ü : \u00fc
ş : \u015f
ö : \u00f6
ç : \u00e7
ı : \u0131
û : \u00fb
*/
return encodeURIComponent(
word.replace(/\u011f/g, '_g')
.replace(/\u00fc/g, '_u')
.replace(/\u00fb/g, '_u')
.replace(/\u015f/g, '_s')
.replace(/\u00f6/g, '_o')
.replace(/\u00e7/g, '_c')
.replace(/\u0131/g, '_i'));
}
what I figured out is every client sends the same string with different encodings
Whilst that would be normal for <form> submissions, it should not happen for XMLHttpRequest work. The encodeURIComponent function explicitly always writes URL-encoded UTF-8 bytes, regardless of the encoding of the page from which it was used. Of course persuading your servlet container to allow you to read those UTF-8 bytes without messing them up is another story, but that shouldn't depend on the client.
What might be a problem is if you are using raw non-ASCII characters inside your script file itself. In that case the interpretation of those characters will vary according to the charset the browser is using to load the script. This may be affected by:
any charset declared in the Content-Type: text/javascript;charset= header.
any charset attribute declared on the <script src="..." charset="..."> element.
the charset of the page that included the script.
(1) and (2) are not supported in all browsers. Normally you can rely on (3), but as a third-party script author that is out of your control. Therefore you should use only ASCII characters in your script. (Use \u1234 escapes to include non-ASCII characters in string literals in your script to get around this limitation.)
Do you specify the encoding of the JavaScript file in the HTTP headers? Like Content-type: text/javascript; charset=utf-8 with the .js file beign saved in UTF-8 of course. With Apache, you can configure
AddCharset utf-8 .js
Or you can make the hosted javascript file create another script tag with a charset='utf-8' parameter and add-it to the head element (like most bookmarklets do).
I think the javascript being interpreted as UTF-8 code should then get/manipulate UTF-8 strings.
Then, in your Java Servlet, you can specify the input encoding to use:
request.setCharacterEncoding("UTF-8");
Edit: check this page about Character Encoding in JavaScript, especially the part named "Setting the Character Encoding".

Categories