GWT: Encode `'` characters in string - java

I need to encode a string in such a way that ASCII punctuation character ' will be encoded too. When we don't use gwt it looks like
URLEncoder.encode(string, "UTF-8");
and it works exactly as I expect.
I see this question about URLEncoder equivalent in gwt. But according to the documentation, ASCII punctuation characters
- _ . ! ~ * ' ( )
will not be escaped by method com.google.gwt.http.client.URL.encode(string).
What is the right way to encode a string such that all ' will be encoded too?
Thank you in advance!

If you are using ASCII chars only this would encode all:
String string = "asc<>&-_()'";
String encoded = "";
for(int i = 0; i < string.length(); i++) {
char c = string.charAt(i);
encoded+= "&#x" + Integer.toHexString(Character.valueOf(c)) + ";";
}
The output is:
asc<>&-_()'

Related

SOLVED - Best way to detect characters not belonging to Windows-1252

I have a service that receives free text, such as name, surname, address, etc. and I want to throw an error if one of the characters sent doesn't belong to the windows 1252 character set but I don't know how to do so in a proper way. What I was thinking about is a regex, but not sure if that is the best option.
The regex would be the letters from cp1252 with any other letter \\w, so, something like this:
String test = "ŠŒŽšœžŸÀÁÂà ÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ Þßàáâãäåæçèéêëìíîïð ñòóôõöøùúûüýþÿ asvsdf QWESA 1234 ÜüËëÄäÖö";
System.out.println(test.matches("[ŠŒŽšœžŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ\\w"
+ "\\d\\s\\.]+"));
I don't need to detect the encoding itself, only if it doesn't belong to the charset.
My suggestion as code:
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
public class Windows1252Tester {
public static void main(String[] args) {
try {
// Can we encode the incoming UTF-8 (per OP) as Windows-1252?
Charset cs = Charset.forName("Windows-1252");
CharsetEncoder enc = cs.newEncoder();
System.out.printf("Can charset %s encode sequence %s? %b%n", cs, args[0], enc.canEncode(args[0]));
}
catch(Throwable t) {
t.printStackTrace();
}
}
}
You need to check if its Unicode code point is outside the range of 0x20 to 0x7E AND 0xA0 to 0xFF, which covers all of the printable ASCII characters and the extended characters in the Windows 1252 character set, excluding the EN dash. Something like this:
String input = "any text goes here";
for (int i = 0; i < input.length(); i++)
{
char c = input.charAt(i);
if (c < 0x20 || (c > 0x7E && c < 0xA0) || c > 0xFF && c != '\u2013')
{
throw new IllegalArgumentException("Character at index " + i + " does not belong to the Windows 1252 character set: " + c);
}
}

Convert Unicode to UTF-8

My question may already have been answered on StackoverFlow, but I can't find it.
My problem is simple: I request data via an API, and the data returned have unicode characters, for example:
"SpecialOffer":[{"title":"Offre Vente Priv\u00e9e 1 jour 2019 2020"}]
I need to convert the "\u00e9e" to "é".
I cant't make a "replaceAll", because I cannot know all the characters that there will be in advance.
I try this :
byte[] utf8 = reponse.getBytes("UTF-8")
String string = new String(utf8, "UTF-8");
But the string still has "\u00e9e"
Also this :
byte[] utf8 = reponse.getBytes(StandardCharsets.UTF_8);
String string = new String(utf8, StandardCharsets.UTF_8);
Also tried this :
string = string.replace("\\\\", "\\");
byte[] utf8Bytes = null;
String convertedString = null;
utf8Bytes = string.getBytes("UTF8") -- Or StandardCharsets.UTF_8 OR UTF-8 OR UTF_8;
convertedString = new String(utf8Bytes, "UTF8") -- Or StandardCharsets.UTF_8 OR UTF-8 OR UTF_8;;
System.out.println(convertedString);
return convertedString;
But it doesn't work either.
I tested other methods but I think I deleted everything like that didn't work so I can't show them to you here.
I am sure there is a very simple method, but I should not search with the right vocabulary on the internet. Can you help me please ?
I wish you a very good day, and thank you very much in advance.
The String.getBytes method requires a valid Charset [1]
From the javadoc [2] the valid cases are
US-ASCII
ISO-8859-1
UTF-8
UTF-16BE
UTF-16LE
UTF-16
So you need to use UTF-8 in the getBytes method.
[1] https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#getBytes-java.nio.charset.Charset-
[2] https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html
You can use small json library
String jsonstring = "{\"SpecialOffer\":[{\"title\":\"Offre Vente Priv\\u00e9e 1 jour 2019 2020\"}]}";
JsonValue json = JsonParser.parse(jsonstring);
String value = json.asObject()
.first("SpecialOffer").asArray().get(0)
.asObject().first("title").asStringLiteral().stringValue();
System.out.println(" result: " + value);
or
String text = "Offre Vente Priv\\u00e9e 1 jour 2019 2020";
System.out.println(" result: " + JsonEscaper.unescape(text));
The problem that I had not seen, is that the API did not return me "\u00e9e" but "\\u00e9e" as it was a character sequence and not a unicode character!
So I have to recreate all the unicodes, and everything works fine!
int i=0, len=s.length();
char c;
StringBuffer sb = new StringBuffer(len);
while (i < len) {
c = s.charAt(i++);
if (c == '\\') {
if (i < len) {
c = s.charAt(i++);
if (c == 'u') {
// TODO: check that 4 more chars exist and are all hex digits
c = (char) Integer.parseInt(s.substring(i, i+4), 16);
i += 4;
} // add other cases here as desired...
}
} // fall through: \ escapes itself, quotes any character but u
sb.append(c);
}
return sb.toString();
Find this solution here:
Java: How to create unicode from string "\u00C3" etc

replacing \n in a String

There are several answers to similar questions as mine, but I have tried several of them and they are not working. I must be doing something stupid.
I have
String newline = System.getProperty("line.separator");
String content = "Test\n another line\n";
if(content.contains("\\n")) {
content = content.replaceAll("(\\n)", newline);
System.out.print(content);
}
I also tried "\n" and "\\n" in the regex. The content remains unchanged using replaceAll.
Okay facts:
\r is a CR, U+000D
\n is a LF, U+000A
Those characters you can put in a String
String s = "line 1.\nline 2.\n";
String newline = System.getProperty("line.separator");
newline can be "\n" (1 char) or "\r\n" (2 chars) or still something else.
If you would read this text, reading first a backslash and then an n, it would be in code:
String nl = "\\n"; // Two chars, an escaped backslash and a `n`.
String nl = "\\" + 'n'; // Two chars, an escaped backslash and a `n`.
If you would want to replace these two chars with a real newline:
s = s.replace("\\n", "\n");
s = s.replace("\\n", newline); // Platform dependent
Now java regex is still more complex, as it escapes regex letters with a backslash, which in Strings is escaped itself:
You will not need a regex replaceAll/replaceFirst here, but it would go as:
s = s.replaceAll("\\\\n", "\n");
The pattern containing two backslashes: regex escaping of one backslash.
String newline = System.getProperty("line.separator");
String content = "Test\\n another line\\n";
if(content.contains("\\n")) {
content = content.replaceAll("\\\\n", newline);
System.out.print(content);
}
The extra 2 slashes are the escape characters
I also tried this and it works
content.replace("\\n", "\\r\\n")
but
content.replaceAll("\\n", "\\r\\n")
does not.
So in the end I used
while(content.contains("\\n")) {
content = content.replace("\\n", newline);
}
And this solves my problem, not elegant, but it works.

Encode/decode hex to utf-8 string

Working on web application which accepts all UTF-8 character's including greek characters following are strings that i want to convert to hex.
Following are different language string which are not working in my current code
ЫЙБПАРО Εγκυκλοπαίδεια éaös Größe Größe
Following are hex conversions by javascript function mentioned below
42b41941141f41042041e 3953b33ba3c53ba3bb3bf3c03b13af3b43b53b93b1 e961f673 4772c3192c2b6c3192c217865 4772f6df65
Javascript function to convert above string to hex
function encode(string) {
var str= "";
var length = string.length;
for (var i = 0; i < length; i++){
str+= string.charCodeAt(i).toString(16);
}
return str;
}
Here it is not giving any error to convert but at java side I'm unable to parse such string used following java code to convert hex
public String HexToString(String hex){
StringBuilder finalString = new StringBuilder();
StringBuilder tempString = new StringBuilder();
for( int i=0; i<hex.length()-1; i+=2 ){
String output = hex.substring(i, (i + 2));
int decimal = Integer.parseInt(output, 16);
finalString.append((char)decimal);
tempString.append(decimal);
}
return finalString.toString();
}
It throws error while parsing above hex string giving parse exception.
Suggest me the solution
Javascript works with 16-bit unicode characters, therefore charCodeAt might return any number between 0 and 65535. When you encode it to hex you get strings from 1 to 4 chars, and if you simply concatenate these, there's no way for the other party to find out what characters have been encoded.
You can work around this by adding delimiters to your encoded string:
function encode(string) {
return string.split("").map(function(c) {
return c.charCodeAt(0).toString(16);
}).join('-');
}
alert(encode('größe Εγκυκλοπαίδεια 维'))

How to convert into cyrillic

Good day.
I got string like this from server
\u041a\u0438\u0441\u0435\u043b\u0435\u0432 \u0410\u043d\u0434\u0440\u0435\u0439
I need to convert it into cyrillic cp-1251 string.
How do i do it? Thank you.
If that is a literal sequence of characters that must decoded, you'll need to first start with something like this (assuming your input is in the string input):
StringBuffer decodedInput = new StringBuffer();
Matcher match = Pattern.compile("\\\\u([0-9a-fA-F]{4})| ").matcher(input);
while (match.find()) {
String character = match.group(1);
if (character == null)
decodedInput.append(match.group());
else
decodedInput.append((char)Integer.parseInt(character, 16));
}
At this point, you should have java string representation of your input in decodedInput.
If your system supports the cp-1251 charset, you can then convert that to cp-1251 with something like this:
Charset cp1251charset = Charset.forName("cp-1251");
ByteBuffer output = cp1251charset.encode(decodedInput.toString());

Categories