com.fasterxml.jackson.databind.ObjectMapper encoding for localised characters - java

A little background before I mention my main issue
We have a module that is converting POJO to JSON via FasterXML. The logic is there are multiple XMLs that are first converted into POJOS and then into JSON.
Each of these multiple JSONs is then clubbed into a single JSON and processed upon by a third party.
The issue is up until the point the Single JSON is formed, everything looks fine.
Once all the JSONs are merged and written to a file, the localised characters are all encoded whereas we want the same to look like how they look in the individual JSON
eg Single JSON snippet
{"title":"Web サーバに関するお知らせ"}
eg Merged JSON Snippet
{"title":"Web \u30b5\u30fc\u30d0\u306b\u95a2\u3059\u308b\u304a\u77e5\u3089\u305b"}
byte[] jsonBytes = objectMapper.writeValueAsBytes(object);
String jsonString = new String(jsonBytes, "UTF-8");
This JSON string is then written to file
BufferedWriter writer = new BufferedWriter(new FileWriter(finalJsonPath));
writer.write(jsonString);
ALso tried the following as I thought we need UTF-8 encoding here for localised characters
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(finalJsonPath),"UTF-8"));
writer.write(jsonString);
The same objectmapper code is used to write to a single json as well, the encoding does not appear at that point..
Please can anyone point out what is causing the encoding issue at merged JSON level?
PS: the code is part of a war which is deployed onto tomcat. Initially we could see ??? (question marks in JSON) after which we added the following to catalina.sh
JAVA_OPTS="$JAVA_OPTS -Dfile.encoding=UTF-8"
Later on, I also added servlet request encoding but that did not help
JAVA_OPTS="$JAVA_OPTS -Dfile.encoding=UTF-8 -Djavax.servlet.request.encoding=UTF-8"
Thanks!

Just observed the code was processing the merged json. It is running native2ascii command on the merged json due to which the json localised content was getting converted into ASCII characters
i ran native2ascii on the json with the -reverse option and my finding was confirmed. -reverse reverted the ascii encoding

Related

How to convert MultipartFile to UTF-8 always while using CSVFormat?

I am using a spring boot REST API to upload csv file MultipartFile. CSVFormat library of org.apache.commons.csv is used to format the MultipartFile and CSVParser is used to parse and the iterated records are stored into the MySql database.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream()));
Observation is that when the CSV files are uploaded with charset of UTF-8 then it works good. But if the CSV file is of a different format (ANSI etc.,) other than it, its encoding German and other language characters to some random symbols.
Example äößü are encoded to ����
I tried the below to specify the encoding standard, it did not work too.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8));
Can you please advise. Thank you so much in advance.
What you did new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8) tells the CSV parser that the content of the inputstream is UTF-8 encoded.
Since UTF-8 is (usally) the standard encoding, this is actually the same as using new InputStreamReader(csvFile.getInputStream()).
If I get your question correctly, this is not what you intended. Instead you want to automatically choose the right encoding based on the Import-file, right?
Unfortunatelly the csv-format does not store the information which encoding was used.
There are some libraries you could use to guess the most probable encoding based on the characters contained in the file. While they are pretty accurate, they are still guessing and there is no guarantee that you will get the right encoding in the end.
Depending on your use case it might be easier to just agree with the consumer on a fixed encoding (i.e. they can upload UTF-8 or ANSI, but not both)
Try as shown below which worked for me for the same issue
new InputStreamReader(csvFile.getInputStream(), "UTF-8")

Write a text file encoded in UTF-8 with a BOM through java.nio

I have to write the output of a database query to a csv file.
Unfortunately, many people in my company are not able to use a nice editor like Notepad++ and keep opening csv files with Excel.
When I write a text/csv file using java.nio like this
public static void main(String[] args) {
Path path = Paths.get("U:\\temp\\TestOutput\\csv_file.csv");
List<String> lines = Arrays.asList("Übernahme", "Außendarstellung", "€", "#", "UTF-8?");
try {
Files.write(path, lines, StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW);
} catch (IOException e) {
e.printStackTrace();
}
}
the file gets created successfully and is encoded in UTF-8.
Now the problem is the missing BOM in that file.
There is no BOM (Notepad++ bottom-right encoding label shows UTF-8) which is no problem for Notepad++
but obviously it is for Excel
and when I use Notepad++'s option Encoding > Convert to UTF-8-BOM, save & close it and open the file in Excel afterwards, it correctly displays all the values, no encoding issues are left.
That leads to the following question:
Can I force java.nio.file.Files.write(...) to add a BOM when using StandardCharsets.UTF-8 or is there any other way in java.nio to achieve the desired encoding?
As far as I know, there's no direct way in the standard Java NIO library to write text files in UTF-8 with BOM format.
But that's not a problem, since BOM is nothing but a special character at the start of a text stream represented as \uFEFF. Just add it manually to the CSV file, f.e.:
List<String> lines =
Arrays.asList("\uFEFF" + "Übernahme", "Außendarstellung", "€", "#", "UTF-8?");
...
I will suggest instead of using "\uFEFF" + "Übernahme", use as "\uFEFF", "Übernahme".
Benefit of doing this is, it will not change the actual data of the file.
In the case of using opencsv API, you are having the headers in first line and data from second line, then adding "," after BOM character, you can have the same header intact, without any prefix to header. If the header got updated then you have to update the code for the data and header mapping too.
If you are using the properties file for header and data mapping then you have to just add an extra mapping for "\uFEFF" as "\uFEFF"=TEMP there.

Pretty print for JSON in Java works fine for the console, but in browser it does not work

I have a JSON file and I want to retrieve its content from a API call within a rest controller created in Java Spring Boot.
I get the content of the .json file into a String and use the below method ( one of them ) in order to pretty print.
If I system.out.println() the output, it gets pretty printed, but in the browser it is displayed roughly and with no indentation. I had more approaches :
String content = new String(Files.readAllBytes(resource.toPath()));
Gson gson = new GsonBuilder().setPrettyPrinting().create();
JsonParser jp = new JsonParser();
JsonElement je = jp.parse(content);
String prettyJsonString = gson.toJson(je);
System.out.println(prettyJsonString);
return prettyJsonString;
The other approach returns the same ugly output in browser, but it also adds "/r/n":
ObjectMapper mapper = new ObjectMapper();
mapper.enable(SerializationFeature.INDENT_OUTPUT);
String prettyJsonString = mapper.writeValueAsString(content);
return prettyJsonString;
Can anyone help me get the pretty output in browser as well?
Formatting String for console output and for HTML output are two VERY different tasks. Method setPrettyPrinting() is for console printing. HTML browser will ignore "\n" symbols and will not respect multiple spaces replacing them with a single space etc. In general, it is usually a client-side task to format the output. But I delt once with this problem and wrote a method that takes a console-formatted string and converts it to Html formatted String. For instance, it replaces all "\n" symbols with br Html tags. It does some other things as well. I had some success with it, but sometimes some unexpected problems occurred. You are welcome to use it. The method is available in MgntUtils Open source library. Here is its JavaDoc. The library itself is available as Maven artifact here and on Github (including source code and JavaDoc) here. An article about the library is here. Your code would look like this:
String htmlString = TextUtils.formatStringToPreserveIndentationForHtml(jsonPrettyString);
I had this same problem and stumbled upon how to get it to pretty print in the browser.
In your application.properties file, add these two lines:
# Preferred JSON mapper to use for HTTP message conversion.
spring.mvc.converters.preferred-json-mapper=gson
# Whether to output serialized JSON that fits in a page for pretty printing.
spring.gson.pretty-printing=true
Reference: https://www.callicoder.com/configuring-spring-boot-to-use-gson-instead-of-jackson/
Maybe related: https://stackoverflow.com/a/62044963

UTF8 characters showing weirdly or random basis in Android TextView

It has been at least 5 applications in which I have attempted to display UTF8 encoded characters and every time, quite sporadically and rarely I see random characters being replaced by diamond question marks (see image for better details).
I enclose a page layout to demonstrate my issues. The layout is very basic, it is very simple poll I am creating. The "Съгласен съм" text is takes from a database, where it has just been inserted by a script, using copy-pasted constant. The text is displayed in TextViews.
Has anyone ever encountered such an issue? Please advise!
EDIT: Something I forgot to mention is that the amount and position of weird characters varies on diffferent Android Phone models.
Finally I got it all sorted out in all my applications. Actually the issues mlet down to 3 different reasons and I will list all of them below so that this findings of mine could help people in the future.
Reason 1: Incorrect encoding of user created file.
This actually was the problem with the application I posted about in the question. The problem was that the encoding of the insert script I used for introducing the values in the database was "UTF8 without BOM". I converted this encoding to "UTF8" using Notepad++ and reinserted the values in the database and the issue was resolved. Thanks to #user3249477 for pointing me to thinking in this direction. By the way "UTF8 without BOM" seems to be the default encoding Eclipse uses when creating URF8 files, so take care!
Reason 2: Incorrect encoding of generated file.
The problem of reason 1, pointed me to what to think for in some of the other cases I was facing. In one application of mine I am provided with raw data that I insert in my backend database using simple Java application. The problem there turned out to be that I was passing through intermediate format, files stored on the file system that ?I used to verify I interpretted the raw data correctly. I noticed that these files were also created "UTF8 without BOM". I used this code to write to these files:
BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(outputFilePath));
writer = new BufferedWriter(new OutputStreamWriter(outputStream, STRING_ENCODING));
writer.append(string);
Which I changed to:
BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(outputFilePath));
writer = new BufferedWriter(new OutputStreamWriter(outputStream, STRING_ENCODING));
// prepending a bom
writer.write('\ufeff');
writer.append(string);
Following the prescriptions from this answer. This line I add basically made all the intermediate files be encoded in "UTF8" with BOM and resolved my encoding issues.
Reason 3: Incorrect parsing of HTTP responses
The last issue I encountered in few of my applications was that I was not interpretting the UTF8 http responses correctly. I used to have the following code:
HttpResponse response = httpClient.execute(host, request, (HttpContext) null);
String responseBody = null;
responseBody = IOHelper.getInputStreamContents(responseStream);
Where IOHelper is an util I have written myself and reads stream contents to String. I replaced this code with the already provided method in the Android API:
HttpResponse response = httpClient.execute(host, request, (HttpContext) null);
String responseBody = null;
if (response.getEntity() != null) {
responseBody = EntityUtils.toString(response.getEntity(), HTTP.UTF_8);
}
And this fixed the encoding issues I was having with HTTP responses.
As conclusion I can say that one needs to take special care of BOM / without BOM strings when using UTF8 encoding in Android. I am very happy I learnt so many new things during this investigation.

Java servlet json object containing XML, encoding problems

I have a servlet which should reply to requests in Json {obj:XML} (meaning a Json containing an xml object inside).
The XML is encoded in UTF-8 and has several chars like => पोलैंड.
The XML is in a org.w3c.dom.Document and I am using JSON.org library to parse JSON. When i try to print it on the ServletOutputStream, the characters are not well encoded. I have tested it trying to print the response in a file, but the encoding is not UTF-8.
Parser.printTheDom(documentFromInputStream,byteArrayOutputStream);
OutputStreamWriter oS=new OutputStreamWriter(servletOutputStream, "UTF-8");
oS.write((jsonCallBack+"("));
oS.write(byteArrayOutputStream);
oS.write(");");
I have tryed even in local (without deploing the servlet) the previous and the next code :
oS.write("पोलैंड");
and the result is the same.
Instead when I try to print the document,the file is a well formed xml.
oS.write((jsonCallBack+"("));
Parser.printTheDom(documentFromInputStream,oS);
oS.write(");");
Any help?
Typically, if binary data needs to be part of an xml doc, it's base64 encoded. See this question for more details. I suggest you base64 encode the fields that can have exotic UTF-8 chars and and base64 decode them on the client side.
See this question for 2 good options for base64 encoding/decoding in java.

Categories