How to encode punctuation and delimiter characters in GWT - java

I found in GWT, the method URL.encode() cannot encode punctuation and delimiter characters,is there a way to solve this? any idea is appreciated.
And I am wondering why GWT URL.encode() doesn't encode those characters? Thanks.

You can of course just lookup the character code in an ASCII Table: ' is hex $27
and then replace the character in the string with the corresponding escape sequence: e.g. var='test' will become var=%27test%27
Here's a JS-Fiddle to test it
GWT does not encode it, because it is a valid character in a URL: so I wonder, why you would want to encode it?

URL.encode() simply defers to JavaScript's encodeURI.
If you want encodeURIComponent, use URL.encodePathSegment
(or use JsInterop)

Related

How to get a character type in GWT client side?

com.google.gwt.json.client.JSONParser.parseStrict(jsonStr) is throwing a syntax error when the json string contains non-printable/recognized Unicode characters. Hence, I am trying to remove non-printable Unicode characters in client side. Following How can I replace non-printable Unicode characters in Java?, I'm trying to implement this code in client side, but Character.getType(codePoint) Isn't client compatible.
Is there any way to overcome this issue? Any other way to get the character type in client side? Any other suggestion on how to solve the main problem?
Many Thanks!
David.
By the way, I've tried to use my_string.replaceAll("\\p{C}", "?") code and it worked on server side but not on client side.
You can add a native JS method and use a regex inside it to replace every non-printable non-ASCII character, like this:
private native String replaceUnprintableChars(String text, String replacement) /*-{
return text.replace(/[^\u0020-\u007E]/g, replacement);
}-*/;
// Somewhere else....
String replacedText = replaceUnprintableChars(originalString, "?");
The regex shown will replace every non-printable or non-ASCII character with the replacement string (e.g. "?"). If you want to include non-ASCII printable characters (latin, for example) then you can tweak the expression to broaden the range.
Of course, you can do this with Java regexes too:
text.replaceAll("[^\\u0020-\\u007E]", "?");
But I came up with the JS solution first, don't know why!
com.google.common.base.CharMatcher in Guava looks promissing. I have not tested this, but the class is annotated #GwtCompatible.
There is a guide to using it in the StringsExplained article in the Guava docs.
I've changed the use of com.google.gwt.json.client.JSONParser.parseStrict(jsonStr) with com.google.gwt.json.client.JSONParser.parseLenient(jsonStr) and the parser was able to handle those non-printable/recognized Unicode characters like in server side.
I feel comfortable using JSONParser.parseLenient since the jsonStrdoesn't come from user input.

What kind of string uses prefix \x and how to read it

I have a string like this
"\x27\x18\xf6,\x03\x12\x8e\xfa\xec\x11\x0dHL"
when i put it in browser console, it automatically becomes something else:
"\x27\x18\xf6,\x03\x12\x8e\xfa\xec\x11\x0dHL"
"'ΓΆ,ΓΊΓ¬HL"
if I do chatAt(x) over this string, I get:
"\x27\x18\xf6,\x03\x12\x8e\xfa\xec\x11\x0dHL".charAt(0)
"'"
"\x27\x18\xf6,\x03\x12\x8e\xfa\xec\x11\x0dHL".charAt(1)
""
"\x27\x18\xf6,\x03\x12\x8e\xfa\xec\x11\x0dHL".charAt(2)
"ΓΆ"
which IS what I want.
Now I want to implement a Java program that reads the string the same way as in browser.
The problem is, Java does not recognize the way this string is encoded. Instead, it treats it as a normal string:
"\\x27\\x18\\xf6,\\x03\\x12\\x8e\\xfa\\xec\\x11\\x0dHL".charAt(0) == '\'
"\\x27\\x18\\xf6,\\x03\\x12\\x8e\\xfa\\xec\\x11\\x0dHL".charAt(1) == 'x'
"\\x27\\x18\\xf6,\\x03\\x12\\x8e\\xfa\\xec\\x11\\x0dHL".charAt(2) == '2'
What kind of encoding this string is encoded? What kind of encoding uses prefix \x?
Is there a way to read it properly (get the same result as in browser)?
update: I found a solution -> i guess it is not the best, but it works for me:
StringEscapeUtils.unescapeJava("\\x27\\x18\\xf6,\\x03\\x12\\x8e\\xfa\\xec\\x11\\x0dHL".replace("\\x", "\\u00"))
thank you all for your replies :)
especially Ricardo Cacheira
Thank you
\x03 is the ASCII hexadecimal value of char
so this: "\x30\x31" is the same as : "01"
see that page: http://www.asciitable.com
Another thing is when you copy your string without quotation marks your IDE converts any \ to \\
Java String uses unicode escape so this: "\x30\0x31" in java is: "\u0030\u0031";
you can't use these escape sequence in Java String \u000a AND \u000d you should convert it respectively to \r AND \n
So this "\u0027\u0018\u00f6,\u0003\u0012\u008e\u00fa\u00ec\u0011\rHL" is the conversion for Java of this: "\x27\x18\xf6,\x03\x12\x8e\xfa\xec\x11\x0dHL"
apache commons provides a helper for this:
StringEscapeUtils.unescapeJava(...)
Unescapes any Java literals found in the String. For example, it will turn a sequence of '\' and 'n' into a newline character, unless the '\' is preceded by another '\'.

escaped html won't unescaped (now: unescaped html won't escape back)

So I'm currently using the commons lang apache library.
When I tried unescaping this string: πŸ˜€
This returns the same string: πŸ˜€
String characters = "πŸ˜€"
StringEscapeUtils.unescapeHtml(characters);
Output: πŸ˜€
But when I tried unescaping a String with a less few characters, it works:
String characters = "㈳"
StringEscapeUtils.unescapeHtml(characters);
Output: ㈳
Any ideas? When I tried unescaping this String "πŸ˜€" on online unescaping utility, it works, so maybe it's a bug in the apache common langs library? Or can anyone recommend another library?
Thanks.
UPDATES:
I'm now able to unescape the String successfully. The problem now is when I tried to escaped the result of that unescape, it won't bring back the String (πŸ˜€).
unescapeHtml() leaves πŸ˜€ untouched because – as the documentation says – it only unescapes HTML 4.0 entities, which are limited to 65,536 characters. Unfortunately, 128,512 is far beyond that limit.
Have you tried using unescapeXml()?
XML supports up to 1,114,111 (10FFFFh) character entities (link).
This is a unicode character whose index is U+1F600 (128512) - GRINNING FACE
Refer the URL for details
The String you have mentioned is HTML Escape of U+1F600, If you unescape it using Apache commons lang it will draw you the required smiley as provided in screenshot
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Regarding your update that its not converting back to πŸ˜€
You can also represent a character using a Numeric Character Reference, of the form &#dddd;, where dddd is the decimal value representing the character's Unicode scalar value. You can alternatively use a hexadecimal representation &#xhhhh;, where hhhh is the hexadecimal value equivalent to the decimal value.
A good site for this
Have added few SoP to help you understand this unicode better.
Well - the solution is pretty easy:
use org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 instead! (unless you're using Java <1.5, which you probably won't)
String characters = "πŸ˜€";
StringEscapeUtils.unescapeHtml4(characters);
i think the problem is that there is no unicode character "πŸ˜€"
so the method simply returns this string.
the doc of the function says only
Returns: a new unescaped String, null if null string input
If it's a HTML specific question, then you can just use JavaScript for this purpose.
You can do
escape("πŸ˜€") which gives you %26%23128512%3B
unescape("%26%23128512%3B") which gives you back πŸ˜€

Java String#contains() using String#matches() with escape character

I need a simple way to implement the contains function using matches. I believe this is my starting point:
xxx.matches("'.*yyy.*'");
But I need to make it a universal method and pre-process whatever I search for to be accepted by matches! This must be done using only the escape '\' character!
Imagine a string SEARCH_FOR that can contain some special characters that must be "regex escaped"...
String SEARCH_FOR="*.\\"
xxx.matches("'.*" + SEARCH_FOR + ".*'");
Are there any catches? Special situations? Any other "special chars should be taken into account?
Are you looking for Pattern.quote(String) ?
This escapes special characters for you.
EDIT:
After reading the comments, I really hope you try Pattern.quote(yourString.toLowerCase()) as it sounds like you've been using Pattern.quote(yourString).toLowerCase(). If DataNucleus is applying the regex then there should be no problems with using the \Q and \E escape sequence.
Since you have really asked for it, ".\\".replaceAll("(\\.|\\$|\\+|\\*|\\\\)", "\\\\\$1") outputs \.\\
This will escape .'s, $'s, + 's, *'s and \'s. Note that the security of this is now all upon you. If you don't escape something you needed to, or you escape it incorrectly, you will either allow people to use regex inside the search term when you weren't expecting to or it won't returns results that you were expecting.

Input Sanitizing to not break JSON syntax

So, in a nutshell I'm trying to create a regex that I can use in a java program that is about to submit a JSON object to my php server.
myString.replaceAll(myRegexString,"");
My question is that I am absolutely no good with regex and to add onto that I need to escape the characters properly as its stored in a string, and then also escape the characters properly inside the regex. good lordy.
What I came up with was this:
String myRegexString = "[\"',{}[]:;]"
The first backslash was to escape outer quotes to get a " in there. And then it struck me that {} and [] are also regex commands. Would I escape those as well? Like:
String myRegexString = "[\"',\{\}\[\]:;]"
Thanks in advance. In case it wasnt clear from examples above the only characters I really care about at this moment in time is:
" { } [ ] , and also ; : ' for general sqlinj protection.
UPDATE:
This is the final regex:
[\\Q\"',{}[\]:;\\E] for anyone else curious. Thanks Amit!
Why don't you use an actual JSON encoding API/framework? What you're doing is not sanitizing. What you're doing is corrupting the data. If my name is O'Reilly, I want it to be spelled O'Reilly, not OReilly. If I send a message containing [ or {, I want these to be in the messages. Use a framework or API that escapes those characters when needed rather than removing them blindly.
Googling for JSON Java will lead you to many APIs and frameworks.
Try something like
String myRegexString = "[\\Q\"',{}[]:;\\E]";
now the characters between \Q and \E are now treated as normal characters.

Categories