Parameters charset conversion in struts2 - java

I have a struts2 web application which accepts both POST and GET requests in many different charsets, does conversion of them into utf-8, displays the correct utf-8 characters on the screen and then writes them into utf-8 database.
I have tried at least 5 different methods for doing simple losless charset conversion of windows-1250 to utf-8 to start with, and all of them did not work. Utf-8 being the "larger set", it should work without a problem (at least this is my understanding).
Can you propose how to do a charset conversion from windows-1250 to utf-8, and is it possible that struts2 is doing something weird with the params charset, which would explain why I can't seem to get it right.
This is my latest attempt:
String inputData = getSimpleParamValue("some_input_param_from_get");
Charset inputCharset = Charset.forName("windows-1250");
Charset utfCharset = Charset.forName("UTF-8");
CharsetDecoder decoder = inputCharset.newDecoder();
CharsetEncoder encoder = utfCharset.newEncoder();
String decodedData = "";
try {
ByteBuffer inputBytes = ByteBuffer.wrap(inputData.getBytes()); // I've tried putting UTF-8 here as well, with no luck
CharBuffer chars = decoder.decode(inputBytes);
ByteBuffer utfBytes = encoder.encode(chars);
decodedData = new String(utfBytes.array());
} catch (CharacterCodingException e) {
logger.error(e);
}
Any ideas on what to try to get this working?
Thank you and best regards,
Bozo

I'm not sure of your scenario. In Java, a String is Unicode, one only deals with charset conversion when has to convert from/to String to/from a binary representation.
In your example, when getSimpleParamValue("some_input_param_from_get") is called, inputData should already have the "correct" String, the conversion from the stream of bytes (that travelled from the client browser to the web server) to a string should have already taken part (responsability of the web server+web layer of your application).
For this, I enforce UTF-8 for the web trasmission, placing a filter in the web.xml (before Struts), for example:
public class CharsetFilter implements Filter {
public void destroy() {}
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
HttpServletRequest req = (HttpServletRequest) request;
HttpServletResponse res = (HttpServletResponse) response;
req.setCharacterEncoding("UTF-8");
chain.doFilter(req, res);
String contentType = res.getContentType();
if( contentType !=null && contentType.startsWith("text/html"))
res.setCharacterEncoding("UTF-8");
}
public void init(FilterConfig filterConfig) throws ServletException {
}
}
If you cannot do this, and if your getSimpleParamValue() "errs" in the charset conversion (eg: it assumed the byte stream was UTF-8 and was windows-1250) you now have an "incorrect" string, and you must try to recover it by undoing and redoing the byte-to-string conversion - in which case you must know the wrong AND the correct charset - and, worse, deal with the possibity of missing chars (if it was interpreted as UTF8, i maight have found illegal char sequence).
If you have to deal with this in a Struts2 action, I'd say you are in problems, you should deal with it explicitly before/after it (in the upper web layer - or in the Database driver or File encoding or whatever)

Related

Turkish characters could not be displayed in website. Tomcat Server, Mysql Database

Turkish characters displayed as '?' and inserted the database incorrectly.
I can change the text in phpmyadmin and make it turkish again. By this way everything works perfect.
My encoding filter:
private String encoding;
public void init(FilterConfig config) throws ServletException {
encoding = config.getInitParameter("requestEncoding");
if (encoding == null)
encoding = "UTF-8";
}
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain next) throws IOException, ServletException {
// Respect the client-specified character encoding
// (see HTTP specification section 3.4.1)
if (null == request.getCharacterEncoding())
request.setCharacterEncoding(encoding);
/**
* Set the default response content type and encoding
*/
response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
next.doFilter(request, response);
}
public void destroy() {
}
I don't have turkish character problem in localhost.I see the problem in www.gurkanilleez.com only.
Question marks for each non-English character? What probably happened:
you had utf8-encoded data (good)
SET NAMES latin1 was in effect (default, but wrong)
the column was declared CHARACTER SET latin1 (default, but wrong)
Use utf8 instead of latin1.
For JDBC, include ?useUnicode=yes&characterEncoding=UTF-8 in the getConnection() call.
There are several threads about Turkish character set problems in stackoverflow; read them.
I cannot make a comment because of my reputation. I visited your website and I see the turkish characters as they should be. So, there is no problem or?

Unicode strings from Java to Javascript via JSON

In a Java servlet I'm doing:
protected void handleRequests(HttpServletRequest request, HttpServletResponse response)
PrintWriter pw = response.getWriter();
/*...*/
Vector<String> buf = new Vector<>();
for(...) {
ret.add(">žd¿ [?²„·ÜðÈ ‘");
}
/*JSONArray*/ responseArray.put(responseArray.length(), buf);
/*...*/
pw.println(responseArray);
pw.close();
}
In a web page client javascript I'm doing a XMLHttpRequest and the reply is incorrect, looks like: >?d¿ [\u001a?²\u201e·ÜðÈ \u2018
(for the above >žd¿ [?²„·ÜðÈ ‘ input)
Then I tried on the servlet:
ret.add(URLEncoder.encode(">žd¿ [?²„·ÜðÈ ‘", "UTF-8"));
and I get:
%3E%C5%BEd%C2%BF%C2%A0%5B%7F%1A%3F%C2%B2%E2%80%9E%C2%B7%C3%9C%C3%B0%C3%88%C2%A0%E2%80%98
in javascript, then I apply:
unescape(reply.replace(/\+/g,' ') (the replace is because + signs are not converted to spaces)
which nets me:
>žd¿ [?²â??·Ã?ðÃ? â
What do I do wrong?
(Some other questions tells me the servlet should send as utf8. But when do I encode in utf8 - before placing inside a JSON object (I use org.json.) or after (with a .toString on the JSON response array and then convert to utf8 before PrintWriter.println)
P.S. This is not all my code, I've inherited it and some of the theoretical background I'm lacking.
Edit:
doing a decodeURIComponent(reply).replace(/\+/g,' ') in javascript seems to do the trick. But I could not find the difference between URLEncoder.encode and decodeURIComponent. Is the +/space the only mismatch?
decodeURIComponent nets the expected result
decodeURIComponent("%3E%C5%BEd%C2%BF%C2%A0%5B%7F%1A%3F%C2%B2%E2%80%9E%C2%B7%C3%9C%C3%B0%C3%88%C2%A0%E2%80%98");
">žd¿ [?²„·ÜðÈ ‘"

How to unescape html special characters in Java?

I have some text strings that I need to process and inside the strings there are HTML special characters. For example:
10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂
I would like to convert those characters to utf-8.
I used org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 but didn't have any luck. Is there an easy way to deal with this problem?
Apache commons-text library has the StringEscapeUtils class that has the unescapeHtml4() utility method.
String utf8Str = StringEscapeUtils.unescapeHtml4(htmlStr);
You may also need unescapeXml()
#Bohemian 's code is correct, It works for me, your un-encoded string is 10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂.
Now, I'm adding another answer instead of commenting on Bohemian's answer because there are two things that still need to be mentioned:
I copy-pasted your string into HTML code and the browser can't render your characters properly, because your String is incorrectly encoded, i. e. the string has encoded the high surrogate and the low one for two-bytes-chars separately, instead of encoding the whole codepoint (it seems the original string is a UTF-16 encoded string, maybe a Java String?).
You want the string to be re-encoded to UTF-8.
Once you have your String unencoded by StringEscapeUtils.unescapeHtml(htmlStr) (which un-encodes your string successfully despite being encoded incorrectly), it doesn't have much sense talking about "string encodings" as java strings are "unaware" about encodings. (they use UTF-16 internally though).
If you need a group of bytes containing a UTF-8 encoded "string", you need to get the "raw" bytes from a String encoded as UTF-8:
String javaStr = StringEscapeUtils.unescapeHtml(htmlStr);
byte[] rawUft8String = javaStr.getBytes("UTF-8");
And do with such byte array whatever you need.
Now if what you need is to write a UTF-8 encoded string to a File, instead of that byte array you need to specify the encoding when you create the proper java.io.Writer.
Try this code to un-encode your string (change the file path first) and then open the resulting file in any editor that supports UTF-8:
java.io.Writer approach (better):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(Writer output = new OutputStreamWriter(
new FileOutputStream("/path/to/testing.txt"), "UTF-8")) {
output.write(javaString);
}
}
java.io.OutputStream approach (if you already have a "raw string"):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(OutputStream output = new FileOutputStream("/path/to/testing.txt")) {
for (byte b : javaString.getBytes(Charset.forName("UTF-8"))) {
output.write(b);
}
}
}

How to check the charset of string in Java?

In my application I'm getting the user info from LDAP and sometimes the full username comes in a wrong charset. For example:
ТеÑÑ61 ТеÑÑовиÑ61
It can also be in English or in Russian and displayed correctly. If the username changes it's updated in database. Even if I change the value in the db it wont solve the problem.
I can fix it before saving by doing this
new String(incorrect.getBytes("ISO-8859-1"), "UTF-8");
However, if I will use it for the string including characters in Russian (for ex., "Тест61 Тестович61") I get something like this "????61 ????????61".
Can you please suggest something that can determine the charset of string?
Strings in java, AFAIK, do not retain their original encoding - they are always stored internally in some Unicode form.
You want to detect the charset of the original stream/bytes - this is why I think your String.toBytes() call is too late.
Ideally if you could get the input stream you are reading from, you can run it through something like this: http://code.google.com/p/juniversalchardet/
There are plenty of other charset detectors out there as well
I had the same problem. Tika is too large and juniversalchardet do not detect ISO-8859-1. So, I did myself and now is working well in production:
public String convert(String value, String fromEncoding, String toEncoding) {
return new String(value.getBytes(fromEncoding), toEncoding);
}
public String charset(String value, String charsets[]) {
String probe = StandardCharsets.UTF_8.name();
for(String c : charsets) {
Charset charset = Charset.forName(c);
if(charset != null) {
if(value.equals(convert(convert(value, charset.name(), probe), probe, charset.name()))) {
return c;
}
}
}
return StandardCharsets.UTF_8.name();
}
Full description here: Detect the charset in Java strings.
I recommend Apache.tika CharsetDetector, very friendly and strong.
CharsetDetector detector = new CharsetDetector();
detector.setText(yourStr.getBytes());
detector.detect(); // <- return the result, you can check by .getName() method
Further, you can convert any encoded string to your desired one, take utf-8 as example:
detector.getString(yourStr.getBytes(), "utf-8");
I highly appreciate Lluís Turró Cutiller's answer (+1), but want to add a variant based on that.
private String convert(String value, Charset fromEncoding, Charset toEncoding) throws UnsupportedEncodingException {
return new String(value.getBytes(fromEncoding), toEncoding);
}
private boolean probe(String value, Charset charset) throws UnsupportedEncodingException {
Charset probe = StandardCharsets.UTF_8;
return value.equals(convert(convert(value, charset, probe), probe, charset));
}
public String convert(String value, Charset charsetWanted, List<Charset> charsetsOther) throws UnsupportedEncodingException {
if (probe(value, charsetWanted)) {
return value;
}
for (Charset other: charsetsOther) {
if (probe(value, other)) {
return convert(value, other, charsetWanted);
}
}
System.err.println("WARNING: Could not convert string: " + value);
return value;
}
Your LDAP database is set up incorrectly. The application putting data into it should convert to a known character set encoding, in your case, likely UTF_16. Pick a standard. All methods of detecting encoding are guesses.
The application writing the value is the only one that knows definitively which encoding it is using and can properly convert to another encoding such as UTF_16.
In your web-application, you may declare an encoding-filter that makes sure you receive data in the right encoding.
<filter>
<description>Explicitly set the encoding of the page to UTF-8</description>
<filter-name>encodingFilter</filter-name>
<filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
<init-param>
<param-name>forceEncoding</param-name>
<param-value>true</param-value>
</init-param>
</filter>
A spring provided filter makes sure that the controllers/servlets receive parameters in UTF-8.

Convert byte[] to Base64 string for data URI

I know this has probably been asked 10000 times, however, I can't seem to find a straight answer to the question.
I have a LOB stored in my db that represents an image; I am getting that image from the DB and I would like to show it on a web page via the HTML IMG tag. This isn't my preferred solution, but it's a stop-gap implementation until I can find a better solution.
I'm trying to convert the byte[] to Base64 using the Apache Commons Codec in the following way:
String base64String = Base64.encodeBase64String({my byte[]});
Then, I am trying to show my image on my page like this:
<img src="data:image/jpg;base64,{base64String from above}"/>
It's displaying the browser's default "I cannot find this image", image.
Does anyone have any ideas?
Thanks.
I used this and it worked fine (contrary to the accepted answer, which uses a format not recommended for this scenario):
StringBuilder sb = new StringBuilder();
sb.append("data:image/png;base64,");
sb.append(StringUtils.newStringUtf8(Base64.encodeBase64(imageByteArray, false)));
contourChart = sb.toString();
According to the official documentation Base64.encodeBase64URLSafeString(byte[] binaryData) should be what you're looking for.
Also mime type for JPG is image/jpeg.
That's the correct syntax. It might be that your web browser does not support the data URI scheme. See Which browsers support data URIs and since which version?
Also, the JPEG MIME type is image/jpeg.
You may also want to consider streaming the images out to the browser rather than encoding them on the page itself.
Here's an example of streaming an image contained in a file out to the browser via a servlet, which could easily be adopted to stream the contents of your BLOB, rather than a file:
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException
{
ServletOutputStream sos = resp.getOutputStream();
try {
final String someImageName = req.getParameter(someKey);
// encode the image path and write the resulting path to the response
File imgFile = new File(someImageName);
writeResponse(resp, sos, imgFile);
}
catch (URISyntaxException e) {
throw new ServletException(e);
}
finally {
sos.close();
}
}
private void writeResponse(HttpServletResponse resp, OutputStream out, File file)
throws URISyntaxException, FileNotFoundException, IOException
{
// Get the MIME type of the file
String mimeType = getServletContext().getMimeType(file.getAbsolutePath());
if (mimeType == null) {
log.warn("Could not get MIME type of file: " + file.getAbsolutePath());
resp.setStatus(HttpServletResponse.SC_INTERNAL_SERVER_ERROR);
return;
}
resp.setContentType(mimeType);
resp.setContentLength((int)file.length());
writeToFile(out, file);
}
private void writeToFile(OutputStream out, File file)
throws FileNotFoundException, IOException
{
final int BUF_SIZE = 8192;
// write the contents of the file to the output stream
FileInputStream in = new FileInputStream(file);
try {
byte[] buf = new byte[BUF_SIZE];
for (int count = 0; (count = in.read(buf)) >= 0;) {
out.write(buf, 0, count);
}
}
finally {
in.close();
}
}
If you don't want to stream from a servlet, then save the file to a directory in the webroot and then create the src pointing to that location. That way the web server does the work of serving the file. If you are feeling particularly clever, you can check for an existing file by timestamp/inode/crc32 and only write it out if it has changed in the DB which can give you a performance boost. This file method also will automatically support ETag and if-modified-since headers so that the browser can cache the file properly.

Categories