Encoding pdf in utf-8 with wkhtmltopdf in Java - java

I've had GWT, ext-GWT web-project in utf-8 charset. I've also add utf-8 charset in html file.
Additionally I need to create some pdf report. So that I wrote a special servlet, which take a template html-file, add some information (in local language) and convert a new generated html-file to pdf-file with wkhtmltopdf.
Now, when I try to convert that generated html-file (which also had <meta charset="UTF-8">) to pdf-file, information with local language (some Strings), which I send from client code to servlet is replacing in the result pdf-file with "?" symbol in a black rhomb.
To solve this problem I add some parameters to Process Builder such as "--encoding" "utf-8":
private void ConvertHTMLtoPDF(String sConvertationProgramm, String sHTML, String sPDF)
{
try {
ProcessBuilder pb = new ProcessBuilder(sConvertationProgramm, "--encoding", "utf-8", sHTML, sPDF);
Process process = pb.start();
process.waitFor();
} catch (Exception e) {
e.printStackTrace();
}
}
but all the same, there is no effect at all. At the same time I have those symbols only on tomcat server, without same trouble at jetty.
So, Where is the problem, on the sending local information to the server (cause another information in a local language from template html-file is showed correctly) or when it's writing/converting at the server side?
Does anyone has any suggestions? Thanks

I found my problem. And I have to give my apologises for everyone, because my problem was in the writing new generated html, so instead of using this:
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
BufferedWriter bw = new BufferedWriter(new FileWriter(fHTML));
I must use this:
BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF8"));
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fHTML), "UTF-8"));
So, the problem was on reading/writing html-files.

Related

Java - Why does BufferedReader(Writer) create a corrupted excel(.xls), but BufferedInput(Output)Stream creates a good one

At the company I work, we have a job that retrieves emails, gets their attachments and saves them. Until now it only had to work with .xml and .txt files and it worked well.
We use the JavaMail 1.4.4 package. Existing code(modified to be more simpler. Don't mind the type checks):
Message message = ...;
MultiPart mp = (MultiPart)message.getContent();
File file = new File(newFileName);
Part part = mp.getBodyPart(indexWhereIsAttachement);
InputStream inputStream = part.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
BufferedWriter writer = new BufferedWriter(new FileWriter(file));
//method that read all from reader and writes to writer
When I use a .xls file, it doesn't work. This creates a corrupted .xls file. I can't open it with LibreOffice, neither can I open it as a Apache WorkBook in code. But it works for .xml and .txt.
But if I do this:
...
File file = new File(newFileName);
Part part = mp.getBodyPart(indexWhereIsAttachement);
((MimeBodyPart)part).saveFile(file);
It works fine. Looking at the "saveFile()" method, it uses a BufferedInput(Output)Stream. So while reading the file, it doesn't convert the data to characters. Is this what's causing the issues? What exactly happens, that breaks everything?

Reliance on default encoding, what should I use and why?

FindBugs reports a bug:
Reliance on default encoding
Found a call to a method which will perform a byte to String (or String to byte) conversion, and will assume that the default platform encoding is suitable. This will cause the application behaviour to vary between platforms. Use an alternative API and specify a charset name or Charset object explicitly.
I used FileReader like this (just a piece of code):
public ArrayList<String> getValuesFromFile(File file){
String line;
StringTokenizer token;
ArrayList<String> list = null;
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(file));
list = new ArrayList<String>();
while ((line = br.readLine())!=null){
token = new StringTokenizer(line);
token.nextToken();
list.add(token.nextToken());
...
To correct the bug I need to change
br = new BufferedReader(new FileReader(file));
to
br = new BufferedReader(new InputStreamReader(new FileInputStream(file), Charset.defaultCharset()));
And when I use PrintWriter the same error occurred. So now I have a question. When I can (should) use FileReader and PrintWriter, if it's not good practice rely on default encoding?
And the second question is to properly use Charset.defaultCharset ()? I decided use this method for automatically defining charset of the user's OS.
Ideally, it should be:
try (InputStream in = new FileInputStream(file);
Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8);
BufferedReader br = new BufferedReader(reader)) {
...or:
try (BufferedReader br = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
...assuming the file is encoded as UTF-8.
Pretty much every encoding that isn't a Unicode Transformation Format is obsolete for natural language data. There are languages you cannot support without Unicode.
If the file is under the control of your application, and if you want the file to be encoded in the platform's default encoding, then you can use the default platform encoding. Specifying it explicitely makes it clearer, for you and future maintainers, that this is your intention. This would be a reasonable default for a text editor, for example, which would then write files that any other editor on this platform would then be able to read.
If, on the other hand, you want to make sure that any possible character can be written in your file, you should use a universal encoding like UTF8.
And if the file comes from an external application, or is supposed to be compatible with an external application, then you should use the encoding that this external application expects.
What you must realize is that if you write a file like you're doing on a machine, and read it as you're doing on another machine, which doesn't have the same default encoding, you won't necessarily be able to read what you have written. Using a specific encoding, to write and read, like UTF8 makes sure the file will always be the same, whatever platform is used when writing the file.
You should use default encoding whenever you read a file that is outside your application and can be assumed to be in the user's local encoding, for example user written text files. You might want to use the default encoding when writing such files, depending on what the user is going to do with that file later.
You should not use default encoding for any other file, especially application relevant files.
If you application for example writes configuration files in text format, you should always specify the encoding. In general UTF-8 is always a good choice, as it is compatible to almost everything. Not doing so might cause surprise crashes by users in other countries.
This is not only limited to character encoding, but as well to date/time, numeric or other language specific formats. If you for example use default encoding and default date/time strings on a US machine, then try to read that file on a German server, you might be surprised why one half is gibberish and the other half has month/days confused or is off by one hour because of daylight saving time.
When you are using a PrintWriter,
File file = new File(file_path);
Writer w = new OutputStreamWriter(new FileOutputStream(file), StandardCharsets.UTF_16.name());
PrintWriter pw = new PrintWriter(w);
pw.println(content_to_write);
pw.close();
This will work:-
FileReader file = new FileReader(csvFile, Charset.forName("UTF-8"));
BufferedReader csvReader = new BufferedReader(file);

Windows-1250 in Eclipse Console

I have got a file in Windows-1250.
I would like to print this file line by line but in Eclipse console I cannot see diacritic signs.
I was trying to make changes in Common tab in run configuration but it gives no results.
I use
BufferedReader reader = new BufferedReader(new FileReader(fileName));
Thank you in advance
Use InputStreamReader or anything that allows specifying the charset:
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(fileName), "Windows-1250"));
may be try to set encoding like this:
PrintStream out = new PrintStream(System.out, true, "Windows-1250");
out.println(message);
may be this helps.
I haven't programmed in java for a while but maybe this class does what you need?
It allows to set charset
The doc of the class you use tells you how to use it.

how to read text file on any machine in java

I am trying to read file,but it is reading only on my machine,it is not working on another machine.Here is my code..
FileInputStream fstream=new FileInputStream("/path of myfile/User.txt");
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String str;
while ((str = br.readLine()) != null) {
System.out.println(str);
}
Please help me,how to read file on another machine as well,what changes should I make?
I'm just guessing that you already found a way to share the file, either with HTTP, FTP, SMB or NFS, but you've some problems, perhaps some funny characters appearing in the text. If you don't name the encoding that you want to use, the default one for the machine will be used, and if they have different defaults, you'll run into problems.
Choose an encoding when writing and reading, for example for UTF8 universal encoding, your source should be modified as:
BufferedReader br = new BufferedReader(new InputStreamReader(in, "UTF8"));
When you write your file, of course, you've to use the same encoding, for instance:
FileOutputStream fos = new FileOutputStream("/path of myfile/User.txt");
OutputStreamWriter out = new OutputStreamWriter(fos, "UTF-8");
If you want to read a file that resides on another machine, you have to serve that file using some kind of network server, like an http-server or an smb-server.

Unicode in Jar resources

I have a Unicode (UTF-8 without BOM) text file within a jar, that's loaded as a resource.
URL resource = MyClass.class.getResource("datafile.csv");
InputStream stream = resource.openStream();
BufferedReader reader = new BufferedReader(
new InputStreamReader(stream, Charset.forName("UTF-8")));
This works fine on Windows, but on Linux it appear not to be reading the file correctly - accented characters are coming out broken. I'm aware that different machines can have different default charsets, but I'm giving it the correct charset. Why would it not be using it?
The reading part looks correct, I use that all the time on Linux.
I suspect you used default encoding somewhere when you export the text to the web page. Due to the different default encoding on Linux and Windows, you saw different result.
For example, you use default encoding if you do anything like this in servlet,
PrintWriter out = response.getWriter();
out.println(text);
You need to specifically write in UTF-8 like this,
response.setContentType("text/html; charset=UTF-8");
out = new PrintWriter(
new OutputStreamWriter(response.getOutputStream(), "UTF-8"), true);
out.println(text);
I wonder if reviewing UTF-8 on Linux would help. Could be a setup issue.

Categories