Unicode in Jar resources - java

I have a Unicode (UTF-8 without BOM) text file within a jar, that's loaded as a resource.
URL resource = MyClass.class.getResource("datafile.csv");
InputStream stream = resource.openStream();
BufferedReader reader = new BufferedReader(
new InputStreamReader(stream, Charset.forName("UTF-8")));
This works fine on Windows, but on Linux it appear not to be reading the file correctly - accented characters are coming out broken. I'm aware that different machines can have different default charsets, but I'm giving it the correct charset. Why would it not be using it?

The reading part looks correct, I use that all the time on Linux.
I suspect you used default encoding somewhere when you export the text to the web page. Due to the different default encoding on Linux and Windows, you saw different result.
For example, you use default encoding if you do anything like this in servlet,
PrintWriter out = response.getWriter();
out.println(text);
You need to specifically write in UTF-8 like this,
response.setContentType("text/html; charset=UTF-8");
out = new PrintWriter(
new OutputStreamWriter(response.getOutputStream(), "UTF-8"), true);
out.println(text);

I wonder if reviewing UTF-8 on Linux would help. Could be a setup issue.

Related

Different character after export application on Eclipse

When I run my application on Eclipse, I can see the correct latin character, like this:
But when I export to runnable jar file and execute it, the special character is wrong, like this:
I have no idea why this happen. On Mac it's ok both on Eclipse and .jar file. But on Windows it's not ok.
I get the data from webserver and I show in a JavaFX ListView.
It is a String turned into UTF-8 bytes shown as some Windows encoding.
My guess you did this:
URL url = ...
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream());
Whereas you should have done this:
URL url = ...
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream(),
StandardCharsets.UTF_8));
The constructor InputStreamReader without Charset will use the current default platform encoding - wrong.
For any URL you could first do an openConnection and try to divine the delivered encoding. The strategy is a bit circumstantial:
connection.getContentEncoding() / getContentType
default is ISO-8859-1
When ISO-8859-1 take Windows-1252 instead, as browser do that too
Java keeps Unicode in String, char, so all scripts can be handled simultaneous.
Binary data, byte[], InputStream, OutputStream, need to have the charset/encoding specified, when it must be converted from/to text.

How to convert strange character from web page?

In the web page, it is "Why don't we" as follows:
But when I parse the webpage and save it to a text file, it becomes this under eclipse:
Why don鈥檛 we
More information about my implementation:
The webpage is: utf-8
I use jSoup to parse, the file is saved as a txt.
I use FileWriter f = new FileWriter() to write to file.
UPDATE:
I actually solve the display problem in eclipse by changing eclipse's encoding to utf-8.
FileWriter is a utility class that uses the default current platform encoding. That is non-portable, and probably incorrect.
BufferedWriter f = new BufferedWriter(New OutputStreamWriter(
new FileOutputStream(file), StandardCharsets.UTF_9));
f,Write("\uFEFF"); // Redundant BOM character might be written to be sure
// the text is read as UTF-8
...

Reliance on default encoding, what should I use and why?

FindBugs reports a bug:
Reliance on default encoding
Found a call to a method which will perform a byte to String (or String to byte) conversion, and will assume that the default platform encoding is suitable. This will cause the application behaviour to vary between platforms. Use an alternative API and specify a charset name or Charset object explicitly.
I used FileReader like this (just a piece of code):
public ArrayList<String> getValuesFromFile(File file){
String line;
StringTokenizer token;
ArrayList<String> list = null;
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(file));
list = new ArrayList<String>();
while ((line = br.readLine())!=null){
token = new StringTokenizer(line);
token.nextToken();
list.add(token.nextToken());
...
To correct the bug I need to change
br = new BufferedReader(new FileReader(file));
to
br = new BufferedReader(new InputStreamReader(new FileInputStream(file), Charset.defaultCharset()));
And when I use PrintWriter the same error occurred. So now I have a question. When I can (should) use FileReader and PrintWriter, if it's not good practice rely on default encoding?
And the second question is to properly use Charset.defaultCharset ()? I decided use this method for automatically defining charset of the user's OS.
Ideally, it should be:
try (InputStream in = new FileInputStream(file);
Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8);
BufferedReader br = new BufferedReader(reader)) {
...or:
try (BufferedReader br = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
...assuming the file is encoded as UTF-8.
Pretty much every encoding that isn't a Unicode Transformation Format is obsolete for natural language data. There are languages you cannot support without Unicode.
If the file is under the control of your application, and if you want the file to be encoded in the platform's default encoding, then you can use the default platform encoding. Specifying it explicitely makes it clearer, for you and future maintainers, that this is your intention. This would be a reasonable default for a text editor, for example, which would then write files that any other editor on this platform would then be able to read.
If, on the other hand, you want to make sure that any possible character can be written in your file, you should use a universal encoding like UTF8.
And if the file comes from an external application, or is supposed to be compatible with an external application, then you should use the encoding that this external application expects.
What you must realize is that if you write a file like you're doing on a machine, and read it as you're doing on another machine, which doesn't have the same default encoding, you won't necessarily be able to read what you have written. Using a specific encoding, to write and read, like UTF8 makes sure the file will always be the same, whatever platform is used when writing the file.
You should use default encoding whenever you read a file that is outside your application and can be assumed to be in the user's local encoding, for example user written text files. You might want to use the default encoding when writing such files, depending on what the user is going to do with that file later.
You should not use default encoding for any other file, especially application relevant files.
If you application for example writes configuration files in text format, you should always specify the encoding. In general UTF-8 is always a good choice, as it is compatible to almost everything. Not doing so might cause surprise crashes by users in other countries.
This is not only limited to character encoding, but as well to date/time, numeric or other language specific formats. If you for example use default encoding and default date/time strings on a US machine, then try to read that file on a German server, you might be surprised why one half is gibberish and the other half has month/days confused or is off by one hour because of daylight saving time.
When you are using a PrintWriter,
File file = new File(file_path);
Writer w = new OutputStreamWriter(new FileOutputStream(file), StandardCharsets.UTF_16.name());
PrintWriter pw = new PrintWriter(w);
pw.println(content_to_write);
pw.close();
This will work:-
FileReader file = new FileReader(csvFile, Charset.forName("UTF-8"));
BufferedReader csvReader = new BufferedReader(file);

Character encoding via JDBC/ODBC/Microsoft Access

I'm doing a connection via JDBC/ODBC to Microsoft Access successfully. After that, I make a query to select rows from Microsoft Access, and I write these results to a TXT file. Everything is OK, but I have some strings that include accents, and these appear as '?' in TXT file. I already tried various forms of methods to write files in java, as PrintWriter, FileWriter, Outputstream, and others, including adding character encoding parameter (UTF-8 or ISO-8859-1) to some these methods. I need any help about some way to show these characters in a right way. Thanks.
Try the below line,
String OUTPUTFILE = "PATH/TO/FILE/";
BufferedWriter bf = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(OUTPUTFILE),"UTF8"));
Once you add that to your code you should be fine using bf.write('VALUE') to write UTF8 characters to your file. And, also make sure to set your text editor encoding to Unicode or UTF8, if you don't it might seem like the hole process didn't work which would lead to even more confusion.
Edited:
To read UTF8 txts
String IPUTFILE = "PATH/TO/File";
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(INPUTFILE), "UTF8"));
then to read line String str = in.readLine();

How can i read a Russian file in Java?

I tried adding UTF-8 for this but it didn't work out. What should i do for reading a Russian file in Java?
FileInputStream fstream1 = new FileInputStream("russian.txt");
DataInputStream in = new DataInputStream(fstream1);
BufferedReader br = new BufferedReader(new InputStreamReader(in,"UTF-8"));
If the file is from Windows PC, try either "windows-1251" or "Cp1251" for the charset name.
If the file is somehow in the MS-DOS encoding, try using "Cp866".
Both of these are single-byte encodings and changing the file type to UTF-8 (which is multibyte) does nothing.
If all else fails, use the hex editor and dump a few hex lines of these file to you question. Then we'll detect the encoding.
As others mentioned you need to know how the file is encoded. A simple check is to (ab)use Firefox as an encoding detector: answer to similar question
If this is a display problem, it depends what you mean by "reads": in the console, in some window? See also How can I make a String with cyrillic characters display correctly?

Categories