reading unicode characters from properties file java - java

Please help me to read the UNICODE characters as it is from the properties file in java. For example : if I pass the key "Account.label.register" it should return to me as "\u5BC4\u5B58\u5668" but not its character representation like "寄存器" . Here is my sample properties file
file_ch.properties
Account.label.register = \u5BC4\u5B58\u5668
Account.label.login = \u767B\u5F55
Account.label.username = \u7528\u6237\u540D
Account.label.password = \u5BC6\u7801
Thank you.
Hi , I am reading properties file using the following java code
#Override
public ResourceBundle getTexts(String bundleName) {
ResourceBundle myResources = null;
try {
myResources = ResourceBundle.getBundle(bundleName, getLocale());
} catch (Exception e) {
myResources = ResourceBundle.getBundle(getDefaultBundleKey(), getLocale());
}
return myResources;
}
Using the above approach it's ok fine, I am getting chinese characters. But for some of the ajax requests in my application I need to pass the chinese text in X-JSON header. Sample code is given below
HashMap<String, List<String>> map = new HashMap<String, List<String>>();
List<String> errors = new ArrayList<String>();
errors.add(str); /*ex: str = "无效的代码" , value taken from properties file through resource bundle*/
map.put("ERROR", errors);
JSONObject json = JSONObject.fromObject(map);
response.setCharacterEncoding("UTF-8");
response.setHeader("X-JSON", json.toString());
response.setStatus(500);
I am passing english for example str="Invalid Code" X-JSON header is carrying the information as it is. But if the str="无效的代码" (chinese or any other text) X-JSON header is carrying the text as empty like below is the response I am getting
response :
connection:close
Content-Encoding:gzip
Content-Type:text/html;charset=UTF-8
Date:Wed, 08 Jun 2016 10:17:43 GMT
Server:Apache-Coyote/1.1
Transfer-Encoding:chunked
Vary:Accept-Encoding
X-JSON:{"ERROR":["Invalid Code"]}
However if the "error" contains "chinese" text for ex:"无效的代码"
response :
connection:close
Content-Encoding:gzip
Content-Type:text/html;charset=UTF-8
Date:Wed, 08 Jun 2016 10:17:43 GMT
Server:Apache-Coyote/1.1
Transfer-Encoding:chunked
Vary:Accept-Encoding
**X-JSON:{"ERROR":[" "]}** /*expecting the response X-JSON:{"ERROR":["无效的代码"]}*/
As the chinese text is coming as empty , I thought of sending unicode through X-JSON header like below
{"ERROR":["\u65E0\u6548\u7684\u4EE3\u7801"]}
After that want to parse the Unicode characters using Javascript code after evaluating X-JSON header like below
var json;
try {
json = xhr.getResponseHeader('X-Json');
} catch (e) {
alert(e);
}
if (json) {
var data = eval('(' + json + ')');
decodeMsg(data);
}
function decodeMsg(message) {
var mssg = message;
var r = /\\u([\d\w]{4})/gi;
mssg = mssg.replace(r, function (match, grp) {
return String.fromCharCode(parseInt(grp, 16)); } );
mssg = unescape(mssg);
return mssg;
}
Please give suggestions. Thank you.

Update of answer:
The original encoding of .properties was in Latin-1, ISO-8859-1 (éö).
This needed u-escaping for the full Unicode range of characters.
However the newer java versions try UTF-8 first. So you can keep the .properties file in UTF-8! Which is a tremendous improvement.
Original answer: .properties in ISO-8859-1 as of java 1.
The error is that in HTTP the header lines are in ISO-8859-1, basic Latin-1.
The solution there is to use %XX conversion of UTF-8 bytes (in this case).
However you are better served in case of JSON simply doing as you intended.
So you want to send u-escaped Unicode, using \uXXXX. As not only Java, but also JavaScript/JSON knows this convention, you only need this u-escaping in java on the server.
static String uescape(String s) {
StringBuilder sb = new StringBuilder(s.length() * 6);
for (int i = 0; i < chars.length; ++i) {
char ch = s.charAt(i);
if (ch < 128) {
sb.append(ch);
} else {
sb.append(String.format("\\u%04X", (int) ch));
}
}
return sb.toString();
}
errors.add(uescape(str));
This zero-pads every non-ASCII (>=128) char as 4 digit hex, the exact format.
Or use apache-commons StringEscapeUtils.escapeJava which also does quotes and \n and such - much safer.

Escape the backslashes in your properties file by doubling them:
Account.label.register = \\u5BC4\\u5B58\\u5668
Account.label.login = \\u767B\\u5F55
Account.label.username = \\u7528\\u6237\\u540D
Account.label.password = \\u5BC6\\u7801

Related

Why iterate over the parts in a multipart email in javamail?

I was looking at that javamail faqs, I was looking at this snippet which is supposed to extract the body of the email:
private boolean textIsHtml = false;
/**
* Return the primary text content of the message.
*/
private String getText(Part p) throws
MessagingException, IOException {
if (p.isMimeType("text/*")) {
String s = (String)p.getContent();
textIsHtml = p.isMimeType("text/html");
return s;
}
if (p.isMimeType("multipart/alternative")) {
// prefer html text over plain text
Multipart mp = (Multipart)p.getContent();
String text = null;
for (int i = 0; i < mp.getCount(); i++) {
Part bp = mp.getBodyPart(i);
if (bp.isMimeType("text/plain")) {
if (text == null)
text = getText(bp);
continue;
} else if (bp.isMimeType("text/html")) {
String s = getText(bp);
if (s != null)
return s;
} else {
return getText(bp);
}
}
return text;
} else if (p.isMimeType("multipart/*")) {
Multipart mp = (Multipart)p.getContent();
for (int i = 0; i < mp.getCount(); i++) {
String s = getText(mp.getBodyPart(i));
if (s != null)
return s;
}
}
return null;
}
Now the code can be refactored to the following version which is basically less lines of code:
private static String getText(Part message) {
String text = null;
try {
if (message.isMimeType("text/*")) {
text = (String) message.getContent();
}
if (message.isMimeType("multipart/alternative") || message.isMimeType("multipart/*")) {
Multipart multiPart = (Multipart) message.getContent();
Part bodyPart = multiPart.getBodyPart(multiPart.getCount() - 1);
text = getText(bodyPart);
}
} catch (Exception e) {
logger.error(e.getMessage());
}
return text;
}
My question is, why the old code looping through the parts for both multipart/alternative and multipart/* messages? Am I missing something here?
Update:
Just saw Jon's comment, I have a further question, is there any scenario where my version of the code will break?
Basically there are many Multipart types and they all need to handled uniquely:
Mixed Subtype
The "mixed" subtype of "multipart" is intended for use when the body
parts are independent and need to be bundled in a particular order.
Any "multipart" subtypes that an implementation does not recognize
must be treated as being of subtype "mixed".
Alternative Subtype
The "multipart/alternative" type is syntactically identical to
"multipart/mixed", but the semantics are different. In particular,
each of the body parts is an "alternative" version of the same
information.
Systems should recognize that the content of the various parts are interchangeable. Systems should choose the "best" type based on the local environment and references, in some cases even through user interaction. As with "multipart/mixed", the order of body parts is significant. In this case, the alternatives appear in an order of increasing faithfulness to the original content.
In general, the best choice is the LAST part of a type supported by the recipient system's local environment.
"Multipart/alternative" may be used, for example, to send a message
in a fancy text format in such a way that it can easily be displayed
anywhere:
From: Nathaniel Borenstein <nsb#bellcore.com>
To: Ned Freed <ned#innosoft.com>
Date: Mon, 22 Mar 1993 09:41:09 -0800 (PST)
Subject: Formatted text mail
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=boundary42
--boundary42
Content-Type: text/plain; charset=us-ascii
... plain text version of message goes here ...
--boundary42
Content-Type: text/enriched
... RFC 1896 text/enriched version of same message
goes here ...
--boundary42
Content-Type: application/x-whatever
... fanciest version of same message goes here ...
--boundary42--
In this example, users whose mail systems understood the
"application/x-whatever" format would see only the fancy version,
while other users would see only the enriched or plain text version,
depending on the capabilities of their system.
Your code won't "work" (whatever that means to you) with a multipart/mixed message where the last attachment is of type text/*. Yes, attachments can be of type text/*.

convert charset X to unicode in Java

How do you convert a specific charset to unicode in Java?
charsets have been discussed quite a lot here, but I think this one hasn't been covered yet.
I have a hex-string that meets the criteria length%4==0 (e.g. \ud3faef8e). usually I just display this in an HTML container and add &#x to the front and ; to the back of each hex quadruple.
but in this case the following procedure led to the correct output (non-Java)
paste hex string into Hex-Editor and save the file to test.txt (utf-8)
open the file with Notepad++
change the encoding to Simplified Chinese (GB2312)
Now I'm trying to do the same in Java.
// having hex convert to ascii
String ascii = "";
for (int cnt = 0; cnt <= unicode.length() - 2; cnt += 2) {
String tmp = unicode.substring(cnt, cnt + 2);
int decimal = Integer.parseInt(tmp, 16);
ascii += (char) decimal;
}
// writing ascii to file at this point leads to the same result as in step 2 before
try {
// get the bytes
byte[] utf8 = ascii.getBytes("UTF-8"); // == UTF8
// convert to gb2312
String converted = new String(utf8, "GB2312"); // == EUC_CN
// write to file (writer with declared UTF-8)
writeToFile(converted, 20 + cntu);
cntu++;
} catch (Exception e) {
System.err.println(e.getMessage());
}
the output looks according the should-output, except the fact that randomly the following character is displayed: � why does this one come up? and how can I get rid of it?
in the end, what I'd like to get is the converted unicode again to be able to display it with my original approach (폴), but I haven't figured out a way to get to the hex values again (they don't match the criteria length%4==0). how do I get the hex values of the characters?
update1
to be more precise, regarding the input, I'm assuming that it is Unicode, because of the start of the String with \u, which would be sufficient for my usual approach, but not in the case I am describing above.
update2
the writeToFile method
FileOutputStream fos = new FileOutputStream("test" + id + ".txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
I tried with GB2312 as well, but there is no change. I still get the ? inbetween the correct characters.
update3
the expected output for \ud3f6ef8e is 遇飵 , you get to it when following the steps 1 to 3. (HxD as an example of an hex editor)
there was no indication that I should delete my question, thus I'm writing my final comment as the answer
I was misinterpreting the incoming hex-digits. they were in a specific charset and not uni-code, so they represented the hex-values of a character in that charset. What I'm doing now is new String(byteArray, "CharsetName"); and get (int)s.charAt(i) to get the unicode value and write it to HTML. thanks for your ideas and hints
for more details see this answer here: https://stackoverflow.com/a/4049781/1338732 , and this question here: How to convert UTF-8 to unicode in Java?

URL encoding arbitary characters

I need to submit application/x-www-form-urlencoded data to a web server.
The server expects the data to be encoded using ISO-8859-1.
Unfortunately URLEncoder.encode(string, "ISO-8859-1"); does not always work.
Any character that is not part of ISO-8859-1, gets encoded as %3F (which is '?').
Firefox handles those chars in some other way that works on the server side.
\uFEFF (Zero Width No-Break Space) gets encoded to %26%2365279%3B which is exactly what I need.
Could anyone please tell me how to mimic this behaviour/what FF does?
To answer my own question:
FF converts the unmappable chars to decimal HTML entities and encodes those using the charset.
\uFEFF -> & #65279; (ignore the space in between) -> %26%2365279%3B
( %26 = & | %23 = # | %3B = ; )
Here is a method that does the first step in Java:
public static String htmlEscapeUnmappableCharaters(String source, String charset) {
CharsetEncoder cse = Charset.forName(charset).newEncoder();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < source.length(); i++) {
if (cse.canEncode(source.charAt(i))) {
sb.append(source.charAt(i));
} else {
sb.append('&');
sb.append('#');
sb.append(source.codePointAt(i));
sb.append(';');
}
}
return sb.toString();
}

Parsing Facebook signed_request using Java returns malformed JSON

I'm trying to parse Facebook signed_request inside Java Servlet's doPost. And I decode the signed request using commons-codec-1.3's Base64.
Here is the code which I used to do it inside servlet's doPost
String signedRequest = (String) req.getParameter("signed_request");
String payload = signedRequest.split("[.]", 2)[1];
payload = payload.replace("-", "+").replace("_", "/").trim();
String jsonString = new String(Base64.decodeBase64(payload.getBytes()));
when I System.out the jsonString it's malformed. Sometime's it misses the ending } of JSON
sometime it misses "} in the end of the string.
How can I get the proper JSON response from Facebook?
facebook is using Base64 for URLs and you are probably trying to decode the text using the standard Base64 algorithm.
among other things, the URL variant doesn't required padding with "=".
you could add the required characters in code (padding, etc)
you can use commons-codec 1.5 ( new Base64(true)), where they added support for this encoding.
The Facebook is sending you "unpadded" Base64 values (the URL "standard") and this is problematic for Java decoders that don't expect it. You can tell you have the problem when the Base64 encoded data that you want to decode has a length that is not a multiple of 4.
I used this function to fix the values:
public static String padBase64(String b64) {
String padding = "";
// If you are a java developer, *this* is the critical bit.. FB expects
// the base64 decode to do this padding for you (as the PHP one
// apparently
// does...
switch (b64.length() % 4) {
case 0:
break;
case 1:
padding = "===";
break;
case 2:
padding = "==";
break;
default:
padding = "=";
}
return b64 + padding;
}
I have never done this in Java so I don't have a full answer, but the fact that you are sometimes losing one and sometimes two characters from the end of the string suggests it may be an issue with Base64 padding. You might want to output the value of payload and see if when it ends with '=' then jsonString is missing '}' and when payload ends with '==' then jsonString is missing '"}'. If that seems to be the case then something is going wrong with the interpretation of the equals signs at the end of payload which are supposed to represent empty bits.
Edit: On further reflection I believe this is because Facebook is using Base64 URL encoding (which does not add = as pad chars) instead of regular Base64, whereas your decoding function is expecting regular Base64 with the trailing = chars.
I've upgraded to common-codec-1.5 using code very similar to this and am not experiencing this issue. Have you confirmed that payload really is malformed by using an online decoder?
Hello in the year 2021.
The other answers are obsolete, because with Java 8 and newer you can decode the base64url scheme by using the new Base64.getUrlDecoder() (instead of getDecoder).
The base64url scheme is a URL and filename safe dialect of the main base64 scheme and uses "-" instead of "+" and "_" instead of "/" (because the plus and slash chars have special meanings in URLs). Also it does not use "=" chars for the padding (0 to 4 chars) at the end of string.
Here is how you can parse the Facebook signed_request parameter in Java into a Map object:
public static Map<String, String> parseSignedRequest(HttpServletRequest httpReq, String facebookSecret) throws ServletException {
String signedRequest = httpReq.getParameter("signed_request");
String splitArray[] = signedRequest.split("\\.", 2);
String sigBase64 = splitArray[0];
String payloadBase64 = splitArray[1];
String payload = new String(Base64.getUrlDecoder().decode(payloadBase64));
try {
Mac sha256_HMAC = Mac.getInstance("HmacSHA256");
SecretKeySpec secretKey = new SecretKeySpec(facebookSecret.getBytes(), "HmacSHA256");
sha256_HMAC.init(secretKey);
String sigExpected = Base64.getUrlEncoder().withoutPadding().encodeToString(sha256_HMAC.doFinal(payloadBase64.getBytes()));
if (!sigBase64.equals(sigExpected)) {
LOG.warn("sigBase64 = {}", sigBase64);
LOG.warn("sigExpected = {}", sigExpected);
throw new ServletException("Invalid sig = " + sigBase64);
}
} catch (IllegalStateException | InvalidKeyException | NoSuchAlgorithmException ex) {
throw new ServletException("parseSignedRequest", ex);
}
// use Jetty JSON parsing or some other library
return (Map<String, String>) JSON.parse(payload);
}
I have used the Jetty JSON parser:
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-util</artifactId>
<version>9.4.43.v20210629</version>
</dependency>
but there are more libraries available in Java for parsing JSON.

UTF-8 Encoding in java, retrieving data from website

I'm trying to get data from website which is encoded in UTF-8 and insert them into the database (MYSQL). Database is also encoded in UTF-8.
This is the method I use to download data from specific site.
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
java.io.InputStreamReader r = null;
StringBuilder content = new StringBuilder();
try {
s = (java.io.InputStream)new URL(url).getContent();
r = new java.io.InputStreamReader(s, "UTF-8");
char[] buffer = new char[4*1024];
int n = 0;
while (n >= 0) {
n = r.read(buffer, 0, buffer.length);
if (n > 0) {
content.append(buffer, 0, n);
}
}
}
finally {
if (r != null) r.close();
if (s != null) s.close();
}
return content.toString();
}
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
All my websites are encoded in UTF-8.
Please help.
If encoding is set to 'windows-1252' (r = new java.io.InputStreamReader(s, "windows-1252"); ) everything works fine and I am getting Côte d'Ivoire on my website (), but in java this title looks like 'C?´te d'Ivoire' what breaks other things, such as for example links. What does it mean ?
I would consider using commons-io, they have a function doing what you want to do:link
That is replace your code with this:
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
String content = null;
try {
s = (java.io.InputStream)new URL(url).getContent();
content = IOUtils.toString(s, "UTF-8")
}
finally {
if (s != null) s.close();
}
return content.toString();
}
if that nots doing start looking into if you can store it to file correctly to eliminate the possibility that your db isn't set up correctly.
Java
The problem seems to lie in the HttpServletResponse , if you have a servlet or jsp page. Make sure to set your HttpServletResponse encoding to UTF-8.
In a jsp page or in the doGet or doPost of a servlet, before any content is sent to the response, just do :
response.setCharacterEncoding("UTF-8");
PHP
In PHP, try to use the utf8-encode function after retrieving from the database.
Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
Thus, the encoding during the display is wrong. How are you displaying it? As per the comments, it's a PHP page? If so, then you need to take two things into account:
Write them to HTTP response output using the same encoding, thus UTF-8.
Set content type to UTF-8 so that the webbrowser knows which encoding to use to display text.
As per the comments, you have apparently already done 2. Left behind 1, in PHP you need to install mb_string and set mbstring.http_output to UTF-8 as well. I have found this cheatsheet very useful.

Categories