Java servlet json object containing XML, encoding problems - java

I have a servlet which should reply to requests in Json {obj:XML} (meaning a Json containing an xml object inside).
The XML is encoded in UTF-8 and has several chars like => पोलैंड.
The XML is in a org.w3c.dom.Document and I am using JSON.org library to parse JSON. When i try to print it on the ServletOutputStream, the characters are not well encoded. I have tested it trying to print the response in a file, but the encoding is not UTF-8.
Parser.printTheDom(documentFromInputStream,byteArrayOutputStream);
OutputStreamWriter oS=new OutputStreamWriter(servletOutputStream, "UTF-8");
oS.write((jsonCallBack+"("));
oS.write(byteArrayOutputStream);
oS.write(");");
I have tryed even in local (without deploing the servlet) the previous and the next code :
oS.write("पोलैंड");
and the result is the same.
Instead when I try to print the document,the file is a well formed xml.
oS.write((jsonCallBack+"("));
Parser.printTheDom(documentFromInputStream,oS);
oS.write(");");
Any help?

Typically, if binary data needs to be part of an xml doc, it's base64 encoded. See this question for more details. I suggest you base64 encode the fields that can have exotic UTF-8 chars and and base64 decode them on the client side.
See this question for 2 good options for base64 encoding/decoding in java.

Related

How to convert MultipartFile to UTF-8 always while using CSVFormat?

I am using a spring boot REST API to upload csv file MultipartFile. CSVFormat library of org.apache.commons.csv is used to format the MultipartFile and CSVParser is used to parse and the iterated records are stored into the MySql database.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream()));
Observation is that when the CSV files are uploaded with charset of UTF-8 then it works good. But if the CSV file is of a different format (ANSI etc.,) other than it, its encoding German and other language characters to some random symbols.
Example äößü are encoded to ����
I tried the below to specify the encoding standard, it did not work too.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8));
Can you please advise. Thank you so much in advance.
What you did new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8) tells the CSV parser that the content of the inputstream is UTF-8 encoded.
Since UTF-8 is (usally) the standard encoding, this is actually the same as using new InputStreamReader(csvFile.getInputStream()).
If I get your question correctly, this is not what you intended. Instead you want to automatically choose the right encoding based on the Import-file, right?
Unfortunatelly the csv-format does not store the information which encoding was used.
There are some libraries you could use to guess the most probable encoding based on the characters contained in the file. While they are pretty accurate, they are still guessing and there is no guarantee that you will get the right encoding in the end.
Depending on your use case it might be easier to just agree with the consumer on a fixed encoding (i.e. they can upload UTF-8 or ANSI, but not both)
Try as shown below which worked for me for the same issue
new InputStreamReader(csvFile.getInputStream(), "UTF-8")

com.fasterxml.jackson.databind.ObjectMapper encoding for localised characters

A little background before I mention my main issue
We have a module that is converting POJO to JSON via FasterXML. The logic is there are multiple XMLs that are first converted into POJOS and then into JSON.
Each of these multiple JSONs is then clubbed into a single JSON and processed upon by a third party.
The issue is up until the point the Single JSON is formed, everything looks fine.
Once all the JSONs are merged and written to a file, the localised characters are all encoded whereas we want the same to look like how they look in the individual JSON
eg Single JSON snippet
{"title":"Web サーバに関するお知らせ"}
eg Merged JSON Snippet
{"title":"Web \u30b5\u30fc\u30d0\u306b\u95a2\u3059\u308b\u304a\u77e5\u3089\u305b"}
byte[] jsonBytes = objectMapper.writeValueAsBytes(object);
String jsonString = new String(jsonBytes, "UTF-8");
This JSON string is then written to file
BufferedWriter writer = new BufferedWriter(new FileWriter(finalJsonPath));
writer.write(jsonString);
ALso tried the following as I thought we need UTF-8 encoding here for localised characters
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(finalJsonPath),"UTF-8"));
writer.write(jsonString);
The same objectmapper code is used to write to a single json as well, the encoding does not appear at that point..
Please can anyone point out what is causing the encoding issue at merged JSON level?
PS: the code is part of a war which is deployed onto tomcat. Initially we could see ??? (question marks in JSON) after which we added the following to catalina.sh
JAVA_OPTS="$JAVA_OPTS -Dfile.encoding=UTF-8"
Later on, I also added servlet request encoding but that did not help
JAVA_OPTS="$JAVA_OPTS -Dfile.encoding=UTF-8 -Djavax.servlet.request.encoding=UTF-8"
Thanks!
Just observed the code was processing the merged json. It is running native2ascii command on the merged json due to which the json localised content was getting converted into ASCII characters
i ran native2ascii on the json with the -reverse option and my finding was confirmed. -reverse reverted the ascii encoding

Base64 Error: The image contents is not valid base64 data java

I am streaming an image to Magento, and encoding an image using android.util.Base64 using either of:
Base64.encodeToString(content, Base64.CRLF)
Base64.encodeToString(content, Base64.DEFAULT)
But I always receive fault:
The image contents is not valid base64 data
Working: I found that the data had to be encoded twice, one time using
Base64 and another encoding using custom Library
Try removing data node from your base64 code for image.
e.g. if you have data like ...
then remove data node. It should look like below and pass it to Magento.
iVBORw0KGgoAAAANSUhEUgAAAVQAAABXCAYAA...

Java RDF validation

I have a RDF file containing some errors(probably unrecognized characters).
Is there any way to find these errors in Java?
any XML contains encoding property in the header. And it's UTF-8 is default. If your XML contains bytes which can't be recognized with SAX parser, so you have not "well-formed" XML. Another way is tell correct charset/encoding to InputStreamReader you use.

Java string encoding conversion within a webpage

I have a webpage that is encoded (through its header) as WIN-1255.
A Java program creates text string that are automatically embedded in the page. The problem is that the original strings are encoded in UTF-8, thus creating a Gibberish text field in the page.
Unfortunately, I can not change the page encoding - it's required by a customer propriety system.
Any ideas?
UPDATE:
The page I'm creating is an RSS feed that needs to be set to WIN-1255, showing information taken from another feed that is encoded in UTF-8.
SECOND UPDATE:
Thanks for all the responses. I've managed to convert th string, and yet, Gibberish. Problem was that XML encoding should be set in addition to the header encoding.
Adam
To the point, you need to set the encoding of the response writer. With only a response header you're basically only instructing the client application which encoding to use to interpret/display the page. This ain't going to work if the response itself is written with a different encoding.
The context where you have this problem is entirely unclear (please elaborate about it as well in future problems like this), so here are several solutions:
If it is JSP, you need to set the following in top of JSP to set the response encoding:
<%# page pageEncoding="WIN-1255" %>
If it is Servlet, you need to set the following before any first flush to set the response encoding:
response.setCharacterEncoding("WIN-1255");
Both by the way automagically implicitly set the Content-Type response header with a charset parameter to instruct the client to use the same encoding to interpret/display the page. Also see this article for more information.
If it is a homegrown application which relies on the basic java.net and/or java.io API's, then you need to write the characters through an OutputStreamWriter which is constructed using the constructor taking 2 arguments wherein you can specify the encoding:
Writer writer = new OutputStreamWriter(someOutputStream, "WIN-1255");
Assuming you have control of the original (properly represented) strings, and simply need to output them in win-1255:
import java.nio.charset.*;
import java.nio.*;
Charset win1255 = Charset.forName("windows-1255");
ByteBuffer bb = win1255.encode(someString);
byte[] ba = new byte[bb.limit()];
Then, simply write the contents of ba at the appropriate place.
EDIT: What you do with ba depends on your environment. For instance, if you're using servlets, you might do:
ServletOutputStream os = ...
os.write(ba);
We also should not overlook the possible approach of calling setContentType("text/html; charset=windows-1255") (setContentType), then using getWriter normally. You did not make completely clear if windows-1255 was being set in a meta tag or in the HTTP response header.
You clarified that you have a UTF-8 file that you need to decode. If you're not already decoding the UTF-8 strings properly, this should no big deal. Just look at InputStreamReader(someInputStream, Charset.forName("utf-8"))
What's embedding the data in the page? Either it should read it as text (in UTF-8) and then write it out again in the web page's encoding (Win-1255) or you should change the Java program to create the files (or whatever) in Win-1255 to start with.
If you can give more details about how the system works (what's generating the web page? How does it interact with the Java program?) then it will make things a lot clearer.
The page I'm creating is an RSS feed that needs to be set to WIN-1255, showing information taken from another feed that is encoded in UTF-8.
In this case, use a parser to load the UTF-8 XML. This should correctly decode the data to UTF-16 character data (Java Strings are always UTF-16). Your output mechanism should encode from UTF-16 to Windows-1255.
byte[] originalUtf8;//Here input
//utf-8 to java String:
String internal = new String(originalUtf8,Charset.forName("utf-8");
//java string to w1255 String
byte[] win1255 = internal.getBytes(Charset.forName("cp1255"));
//Here output

Categories