I want to decode a string that has been encoded using the java.net.URLEncoder.encode() method.
I tried using the unescape() function in javascript, but a problem occurs for blank spaces because java.net.URLEncoder.encode() converts a blank space
to '+' but unescape() won't convert '+' to a blank space.
Try decodeURI("") or decodeURIComponent("") !-)
Using JavaScript's escape/unescape function is almost always the wrong thing, it is incompatible with URL-encoding or any other standard encoding on the web. Non-ASCII characters are treated unexpectedly as well as spaces, and older browsers don't necessarily have the same behaviour.
As mentioned by roenving, the method you want is decodeURIComponent(). This is a newer addition which you won't find on IE 5.0, so if you need to support that browser (let's hope not, nowadays!) you'd need to implement the function yourself. And for non-ASCII characters that means you need to implement a UTF-8 encoder. Code is available if you need it.
decodeURI[Component] doesn't handle + as space either (at least on FF3, where I tested).
Simple workaround:
alert(decodeURIComponent('http://foo.com/bar+gah.php?r=%22a+b%22&d=o%e2%8c%98o'.replace(/\+/g, '%20')))
Indeed, unescape chokes on this URL: it knows only UTF-16 chars like %u2318 which are not standard (see Percent-encoding).
Try
var decoded = decodeURIComponent(encoded.replace(/\+/g," "));
Related
com.google.gwt.json.client.JSONParser.parseStrict(jsonStr) is throwing a syntax error when the json string contains non-printable/recognized Unicode characters. Hence, I am trying to remove non-printable Unicode characters in client side. Following How can I replace non-printable Unicode characters in Java?, I'm trying to implement this code in client side, but Character.getType(codePoint) Isn't client compatible.
Is there any way to overcome this issue? Any other way to get the character type in client side? Any other suggestion on how to solve the main problem?
Many Thanks!
David.
By the way, I've tried to use my_string.replaceAll("\\p{C}", "?") code and it worked on server side but not on client side.
You can add a native JS method and use a regex inside it to replace every non-printable non-ASCII character, like this:
private native String replaceUnprintableChars(String text, String replacement) /*-{
return text.replace(/[^\u0020-\u007E]/g, replacement);
}-*/;
// Somewhere else....
String replacedText = replaceUnprintableChars(originalString, "?");
The regex shown will replace every non-printable or non-ASCII character with the replacement string (e.g. "?"). If you want to include non-ASCII printable characters (latin, for example) then you can tweak the expression to broaden the range.
Of course, you can do this with Java regexes too:
text.replaceAll("[^\\u0020-\\u007E]", "?");
But I came up with the JS solution first, don't know why!
com.google.common.base.CharMatcher in Guava looks promissing. I have not tested this, but the class is annotated #GwtCompatible.
There is a guide to using it in the StringsExplained article in the Guava docs.
I've changed the use of com.google.gwt.json.client.JSONParser.parseStrict(jsonStr) with com.google.gwt.json.client.JSONParser.parseLenient(jsonStr) and the parser was able to handle those non-printable/recognized Unicode characters like in server side.
I feel comfortable using JSONParser.parseLenient since the jsonStrdoesn't come from user input.
When I make web queries, for accented characters, I get special character encodings back as strings such as "\u00f3" , but I need to replace it with the actual character, like "ó" before making another query.
How would I find these cases without actually looking for each one, one by one?
It seems you're handling JSON formatted data.
Use any of the many freely available JSON libraries to handle this (and other parsing issues) for you instead of trying to do it manually.
The one from JSON.org is pretty widely used, but there are surely others that work just as well.
I am using the newer version of YUICompressor (2.4.7) to compress my Javascript and CSS files, for a long time, everything was apparently fine...when I realized that the special characters "í" and "Í" are not being converted successfully. Strangely, another special chars are being converted as we expect. Why just "í" and "Í" are not being converted? Because of just these two chars are not OK, I discarded Charset Conflicts between file system and language. It looks like a bug. Could anyone help me with this problem?
See what happens when I convert files:
Converting CSS
From:
#import url("/láÍíàyout.css");
To:
#import url("/lá�?íàyout.css");
Converting JS
From:
var x = 'cícÍsúlúm irmãêîôûúàá';
To:
var x="c�c�?súlúm irmãêîôûúàá";
Hmm..when it has only to do with i then Turkey test comes to my mind.
The upper case i in Turkish is not I, it is I with a dot on it. When string manipulations are used with toUpperCase() or something then you must pay attention or your program won't run fine on turkish operating systems.
Example:
"fail".toUpperCase().equals("FAIL")
This code tries to do case-insensitive string comparison, but it fails on turkish systems.
When you're using a turkish system then try running it on another non-turkish system and tell us if the bug with YUICompressor still exists.
Is your character set UTF-8? If other, do you specify it (either as command line, or as argument to InputStreamReader/OutputStreamWriter)?
If using as servlet, do you set encoding on both request and response?
I've integrated yui compressor with my application today (version 2.4.7) and it is processing unicode characters correctly, so you may be missing one of above steps.
Is there any real way to represent a URL (which more than likely will also have a query string) as a filename in Java without obscuring the original URL completely?
My first approach was to simply escape invalid characters with arbitrary replacements (for example, replacing "/" with "_", etc).
The problem is, as in the example of replacing with underscores is that a URL such as "app/my_app" would become "app_my_app" thus obscuring the original URL completely.
I have also attempted to encode all the special characters, however again, seeing crazy %3e %20 etc is really not clear.
Thank you for any suggestions.
Well, you should know what you want here, exactly. Keep in mind that the restrictions on file names vary between systems. On a Unix system you probably only need to escape the virgule somehow, whereas on Windows you need to take care of the colon and the question mark as well.
I guess, the safest thing would be to encode anything that could potentially clash (everything non-alphanumeric would be a good candidate, although you migth adapt this to the platform) with percent-encoding. It's still somewhat readable and you're guaranteed to get the original URL back.
Why? URL-encoding is already defined in an RFC: there's not much point in reinventing it. Basically you must have an escape character such as %, otherwise you can't tell whether a character represents itself or an escape. E.g. in your example app_my_app could represent app/my/app. You therefore also need a double-escape convention so you can represent the escape character itself. It is not simple.
When I read the xml through a URL's InputStream, and then cut out everything except the url, I get "http://cliveg.bu.edu/people/sganguly/player/%20Rang%20De%20Basanti%20-%20Tu%20Bin%20Bataye.mp3".
As you can see, there are a lot of "%20"s.
I want the url to be unescaped.
Is there any way to do this in Java, without using a third-party library?
This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.
URLDecoder.decode(url);
This will give you the correct text. The result of decoding the like you provided is this.
http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3
The %20 is an escaped space character. To get the above I used the URLDecoder object.
Starting from Java 11 use
URLDecoder.decode(url, StandardCharsets.UTF_8).
for Java 7/8/9 use URLDecoder.decode(url, "UTF-8").
URLDecoder.decode(String s) has been deprecated since Java 5
Regarding the chosen encoding:
Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites.
I'm having problems using this method when I have special characters like á, é, í, etc. My (probably wild) guess is widechars are not being encoded properly... well, at least I was expecting to see sequences like %uC2BF instead of %C2%BF.
Edited: My bad, this post explains the difference between URL encoding and JavaScript's escape sequences: URI encoding in UNICODE for apache httpclient 4
In my case the URL contained escaped html entities, so StringEscapeUtils.unescapeHtml4() from apache-commons did the trick