Unicode strings from Java to Javascript via JSON - java

In a Java servlet I'm doing:
protected void handleRequests(HttpServletRequest request, HttpServletResponse response)
PrintWriter pw = response.getWriter();
/*...*/
Vector<String> buf = new Vector<>();
for(...) {
ret.add(">žd¿ [?²„·ÜðÈ ‘");
}
/*JSONArray*/ responseArray.put(responseArray.length(), buf);
/*...*/
pw.println(responseArray);
pw.close();
}
In a web page client javascript I'm doing a XMLHttpRequest and the reply is incorrect, looks like: >?d¿ [\u001a?²\u201e·ÜðÈ \u2018
(for the above >žd¿ [?²„·ÜðÈ ‘ input)
Then I tried on the servlet:
ret.add(URLEncoder.encode(">žd¿ [?²„·ÜðÈ ‘", "UTF-8"));
and I get:
%3E%C5%BEd%C2%BF%C2%A0%5B%7F%1A%3F%C2%B2%E2%80%9E%C2%B7%C3%9C%C3%B0%C3%88%C2%A0%E2%80%98
in javascript, then I apply:
unescape(reply.replace(/\+/g,' ') (the replace is because + signs are not converted to spaces)
which nets me:
>žd¿ [?²â??·Ã?ðÃ? â
What do I do wrong?
(Some other questions tells me the servlet should send as utf8. But when do I encode in utf8 - before placing inside a JSON object (I use org.json.) or after (with a .toString on the JSON response array and then convert to utf8 before PrintWriter.println)
P.S. This is not all my code, I've inherited it and some of the theoretical background I'm lacking.
Edit:
doing a decodeURIComponent(reply).replace(/\+/g,' ') in javascript seems to do the trick. But I could not find the difference between URLEncoder.encode and decodeURIComponent. Is the +/space the only mismatch?

decodeURIComponent nets the expected result
decodeURIComponent("%3E%C5%BEd%C2%BF%C2%A0%5B%7F%1A%3F%C2%B2%E2%80%9E%C2%B7%C3%9C%C3%B0%C3%88%C2%A0%E2%80%98");
">žd¿ [?²„·ÜðÈ ‘"

Related

Encoding for unicode and & characters

I am trying to save the below string to my protobuff model:
STOXX®Europe 600 Food&BevNR ETF
But while printing the protomodel value it's displayed like:
STOXX®Europe 600 Food&BevNR ETF
I tried to encode the string to UTF-8 and also tried StringEscapeUtils.unescapeJava(str), but it failed. I'm getting this string by parsing the XML response from server. Any ideas ?
Ref: XML parser Skip invalid xml element with XmlStreamReader
Correcting the XML parsing should be better than needing to unescape everything. Please check below a test case showing this:
public static void main(String[] args) throws Exception {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isCoalescing", true);
ReaderInputStream ris = new ReaderInputStream(new StringReader("<tag>STOXX®Europe 600 Food&BevNR ETF</tag>"));
XMLStreamReader reader = factory.createXMLStreamReader(ris, "UTF-8");
StringBuilder sb = new StringBuilder();
while (reader.hasNext()) {
reader.next();
if (reader.hasText())
sb.append(reader.getText());
}
System.out.println(sb);
}
Output:
STOXX®Europe 600 Food&BevNR ETF
Actually I have protobuf method with me to solve this issue:
ByteString.copyFrom(StringEscapeUtils.unescapeHtml3(string), "ISO-8859-1").toStringUtf8();
Documentation of ByteString
As the text comes from XML use:
s = StringEscapeUtils.unescapeXml(s);
This is way better than unescaping HTML which has hundreds of named entities &...;.
The two rubbish characters instead of the Copyright Symbol are due to reading an UTF-8 encoded text (multibyte for Special chars) as some single Byte Encoding, maybe Latin-1.
This wrong conversion just might be repaired with another conversion, but best would be to read using a UTF-8 Encoding.
// Hack, just patching. Assumes Latin-1 encoding
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
// Or maybe:
s = new String(s.getBytes(), StandardCharsets.UTF_8);
Better inspect the reading code, and look wheter an optional Encoding went missing: InputStreamReader, OutputStreamWriter, new String, getBytes.
Your entire problem would be solved by using an XML reader too.

How to unescape html special characters in Java?

I have some text strings that I need to process and inside the strings there are HTML special characters. For example:
10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂
I would like to convert those characters to utf-8.
I used org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 but didn't have any luck. Is there an easy way to deal with this problem?
Apache commons-text library has the StringEscapeUtils class that has the unescapeHtml4() utility method.
String utf8Str = StringEscapeUtils.unescapeHtml4(htmlStr);
You may also need unescapeXml()
#Bohemian 's code is correct, It works for me, your un-encoded string is 10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂.
Now, I'm adding another answer instead of commenting on Bohemian's answer because there are two things that still need to be mentioned:
I copy-pasted your string into HTML code and the browser can't render your characters properly, because your String is incorrectly encoded, i. e. the string has encoded the high surrogate and the low one for two-bytes-chars separately, instead of encoding the whole codepoint (it seems the original string is a UTF-16 encoded string, maybe a Java String?).
You want the string to be re-encoded to UTF-8.
Once you have your String unencoded by StringEscapeUtils.unescapeHtml(htmlStr) (which un-encodes your string successfully despite being encoded incorrectly), it doesn't have much sense talking about "string encodings" as java strings are "unaware" about encodings. (they use UTF-16 internally though).
If you need a group of bytes containing a UTF-8 encoded "string", you need to get the "raw" bytes from a String encoded as UTF-8:
String javaStr = StringEscapeUtils.unescapeHtml(htmlStr);
byte[] rawUft8String = javaStr.getBytes("UTF-8");
And do with such byte array whatever you need.
Now if what you need is to write a UTF-8 encoded string to a File, instead of that byte array you need to specify the encoding when you create the proper java.io.Writer.
Try this code to un-encode your string (change the file path first) and then open the resulting file in any editor that supports UTF-8:
java.io.Writer approach (better):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(Writer output = new OutputStreamWriter(
new FileOutputStream("/path/to/testing.txt"), "UTF-8")) {
output.write(javaString);
}
}
java.io.OutputStream approach (if you already have a "raw string"):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(OutputStream output = new FileOutputStream("/path/to/testing.txt")) {
for (byte b : javaString.getBytes(Charset.forName("UTF-8"))) {
output.write(b);
}
}
}

Parsing Facebook signed_request using Java returns malformed JSON

I'm trying to parse Facebook signed_request inside Java Servlet's doPost. And I decode the signed request using commons-codec-1.3's Base64.
Here is the code which I used to do it inside servlet's doPost
String signedRequest = (String) req.getParameter("signed_request");
String payload = signedRequest.split("[.]", 2)[1];
payload = payload.replace("-", "+").replace("_", "/").trim();
String jsonString = new String(Base64.decodeBase64(payload.getBytes()));
when I System.out the jsonString it's malformed. Sometime's it misses the ending } of JSON
sometime it misses "} in the end of the string.
How can I get the proper JSON response from Facebook?
facebook is using Base64 for URLs and you are probably trying to decode the text using the standard Base64 algorithm.
among other things, the URL variant doesn't required padding with "=".
you could add the required characters in code (padding, etc)
you can use commons-codec 1.5 ( new Base64(true)), where they added support for this encoding.
The Facebook is sending you "unpadded" Base64 values (the URL "standard") and this is problematic for Java decoders that don't expect it. You can tell you have the problem when the Base64 encoded data that you want to decode has a length that is not a multiple of 4.
I used this function to fix the values:
public static String padBase64(String b64) {
String padding = "";
// If you are a java developer, *this* is the critical bit.. FB expects
// the base64 decode to do this padding for you (as the PHP one
// apparently
// does...
switch (b64.length() % 4) {
case 0:
break;
case 1:
padding = "===";
break;
case 2:
padding = "==";
break;
default:
padding = "=";
}
return b64 + padding;
}
I have never done this in Java so I don't have a full answer, but the fact that you are sometimes losing one and sometimes two characters from the end of the string suggests it may be an issue with Base64 padding. You might want to output the value of payload and see if when it ends with '=' then jsonString is missing '}' and when payload ends with '==' then jsonString is missing '"}'. If that seems to be the case then something is going wrong with the interpretation of the equals signs at the end of payload which are supposed to represent empty bits.
Edit: On further reflection I believe this is because Facebook is using Base64 URL encoding (which does not add = as pad chars) instead of regular Base64, whereas your decoding function is expecting regular Base64 with the trailing = chars.
I've upgraded to common-codec-1.5 using code very similar to this and am not experiencing this issue. Have you confirmed that payload really is malformed by using an online decoder?
Hello in the year 2021.
The other answers are obsolete, because with Java 8 and newer you can decode the base64url scheme by using the new Base64.getUrlDecoder() (instead of getDecoder).
The base64url scheme is a URL and filename safe dialect of the main base64 scheme and uses "-" instead of "+" and "_" instead of "/" (because the plus and slash chars have special meanings in URLs). Also it does not use "=" chars for the padding (0 to 4 chars) at the end of string.
Here is how you can parse the Facebook signed_request parameter in Java into a Map object:
public static Map<String, String> parseSignedRequest(HttpServletRequest httpReq, String facebookSecret) throws ServletException {
String signedRequest = httpReq.getParameter("signed_request");
String splitArray[] = signedRequest.split("\\.", 2);
String sigBase64 = splitArray[0];
String payloadBase64 = splitArray[1];
String payload = new String(Base64.getUrlDecoder().decode(payloadBase64));
try {
Mac sha256_HMAC = Mac.getInstance("HmacSHA256");
SecretKeySpec secretKey = new SecretKeySpec(facebookSecret.getBytes(), "HmacSHA256");
sha256_HMAC.init(secretKey);
String sigExpected = Base64.getUrlEncoder().withoutPadding().encodeToString(sha256_HMAC.doFinal(payloadBase64.getBytes()));
if (!sigBase64.equals(sigExpected)) {
LOG.warn("sigBase64 = {}", sigBase64);
LOG.warn("sigExpected = {}", sigExpected);
throw new ServletException("Invalid sig = " + sigBase64);
}
} catch (IllegalStateException | InvalidKeyException | NoSuchAlgorithmException ex) {
throw new ServletException("parseSignedRequest", ex);
}
// use Jetty JSON parsing or some other library
return (Map<String, String>) JSON.parse(payload);
}
I have used the Jetty JSON parser:
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-util</artifactId>
<version>9.4.43.v20210629</version>
</dependency>
but there are more libraries available in Java for parsing JSON.

UTF-8 Encoding in java, retrieving data from website

I'm trying to get data from website which is encoded in UTF-8 and insert them into the database (MYSQL). Database is also encoded in UTF-8.
This is the method I use to download data from specific site.
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
java.io.InputStreamReader r = null;
StringBuilder content = new StringBuilder();
try {
s = (java.io.InputStream)new URL(url).getContent();
r = new java.io.InputStreamReader(s, "UTF-8");
char[] buffer = new char[4*1024];
int n = 0;
while (n >= 0) {
n = r.read(buffer, 0, buffer.length);
if (n > 0) {
content.append(buffer, 0, n);
}
}
}
finally {
if (r != null) r.close();
if (s != null) s.close();
}
return content.toString();
}
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
All my websites are encoded in UTF-8.
Please help.
If encoding is set to 'windows-1252' (r = new java.io.InputStreamReader(s, "windows-1252"); ) everything works fine and I am getting Côte d'Ivoire on my website (), but in java this title looks like 'C?´te d'Ivoire' what breaks other things, such as for example links. What does it mean ?
I would consider using commons-io, they have a function doing what you want to do:link
That is replace your code with this:
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
String content = null;
try {
s = (java.io.InputStream)new URL(url).getContent();
content = IOUtils.toString(s, "UTF-8")
}
finally {
if (s != null) s.close();
}
return content.toString();
}
if that nots doing start looking into if you can store it to file correctly to eliminate the possibility that your db isn't set up correctly.
Java
The problem seems to lie in the HttpServletResponse , if you have a servlet or jsp page. Make sure to set your HttpServletResponse encoding to UTF-8.
In a jsp page or in the doGet or doPost of a servlet, before any content is sent to the response, just do :
response.setCharacterEncoding("UTF-8");
PHP
In PHP, try to use the utf8-encode function after retrieving from the database.
Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
Thus, the encoding during the display is wrong. How are you displaying it? As per the comments, it's a PHP page? If so, then you need to take two things into account:
Write them to HTTP response output using the same encoding, thus UTF-8.
Set content type to UTF-8 so that the webbrowser knows which encoding to use to display text.
As per the comments, you have apparently already done 2. Left behind 1, in PHP you need to install mb_string and set mbstring.http_output to UTF-8 as well. I have found this cheatsheet very useful.

Parameters charset conversion in struts2

I have a struts2 web application which accepts both POST and GET requests in many different charsets, does conversion of them into utf-8, displays the correct utf-8 characters on the screen and then writes them into utf-8 database.
I have tried at least 5 different methods for doing simple losless charset conversion of windows-1250 to utf-8 to start with, and all of them did not work. Utf-8 being the "larger set", it should work without a problem (at least this is my understanding).
Can you propose how to do a charset conversion from windows-1250 to utf-8, and is it possible that struts2 is doing something weird with the params charset, which would explain why I can't seem to get it right.
This is my latest attempt:
String inputData = getSimpleParamValue("some_input_param_from_get");
Charset inputCharset = Charset.forName("windows-1250");
Charset utfCharset = Charset.forName("UTF-8");
CharsetDecoder decoder = inputCharset.newDecoder();
CharsetEncoder encoder = utfCharset.newEncoder();
String decodedData = "";
try {
ByteBuffer inputBytes = ByteBuffer.wrap(inputData.getBytes()); // I've tried putting UTF-8 here as well, with no luck
CharBuffer chars = decoder.decode(inputBytes);
ByteBuffer utfBytes = encoder.encode(chars);
decodedData = new String(utfBytes.array());
} catch (CharacterCodingException e) {
logger.error(e);
}
Any ideas on what to try to get this working?
Thank you and best regards,
Bozo
I'm not sure of your scenario. In Java, a String is Unicode, one only deals with charset conversion when has to convert from/to String to/from a binary representation.
In your example, when getSimpleParamValue("some_input_param_from_get") is called, inputData should already have the "correct" String, the conversion from the stream of bytes (that travelled from the client browser to the web server) to a string should have already taken part (responsability of the web server+web layer of your application).
For this, I enforce UTF-8 for the web trasmission, placing a filter in the web.xml (before Struts), for example:
public class CharsetFilter implements Filter {
public void destroy() {}
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
HttpServletRequest req = (HttpServletRequest) request;
HttpServletResponse res = (HttpServletResponse) response;
req.setCharacterEncoding("UTF-8");
chain.doFilter(req, res);
String contentType = res.getContentType();
if( contentType !=null && contentType.startsWith("text/html"))
res.setCharacterEncoding("UTF-8");
}
public void init(FilterConfig filterConfig) throws ServletException {
}
}
If you cannot do this, and if your getSimpleParamValue() "errs" in the charset conversion (eg: it assumed the byte stream was UTF-8 and was windows-1250) you now have an "incorrect" string, and you must try to recover it by undoing and redoing the byte-to-string conversion - in which case you must know the wrong AND the correct charset - and, worse, deal with the possibity of missing chars (if it was interpreted as UTF8, i maight have found illegal char sequence).
If you have to deal with this in a Struts2 action, I'd say you are in problems, you should deal with it explicitly before/after it (in the upper web layer - or in the Database driver or File encoding or whatever)

Categories