Java problems with UTF-8 in different OS

Java problems with UTF-8 in different OS - java

I'm programing with other people an application to college homework, and sometime we use non-english characters in comments or in Strings displayed in the views. The problem is that everyone of use is using a different OS and sometimes different IDE's to program.
Concretely, one is using MacOS, another Windows7, and another and me Ubuntu Linux. Furthermore, all of them use Eclipse and I use gedit. We have no idea if Eclipse or gedit are configurable to work propertly with UTF8 bussiness, at least I don't found nothing for mine.
The fact is that what I write with non-english characters appears in Windows & MacOS virtual machines with strange symbols and vice-versa , and sometimes, what my non-linux friends write provokes compilation warnings like this: warning: unmappable character for encoding UTF8.
Do you have any Idea to solve this? It is not really urgent but it will be a help.
Thank you.

Not sure about gedit, but you can certainly configure eclipse to use whatever encoding you like for source code. It's part of the project properties (and saved in the .settings directory within the project).

Eclipse works fine with UTF-8. See Michael's answer about configuring it. Maybe for Windows and/or MacOS it is really necessary. Ubuntu uses UTF-8 as the default encoding so I don't think it's necessary to configure Eclipse there.
As for Gedit, this picture shows that it is possible to change the encoding when saving a file in Gedit.
Anyway, you need to make sure that all of you use UTF-8 for your sources. This is the only reasonable way to achieve cross-platform portability of your sources.

You could avoid the issue in Strings by using character escape sequences, and using only ASCII encoding for the files.
For example, an en dash can be expressed as "\u2013".
You can quickly search for the Java code for individual characters here.
As Sergey notes below, this works best for small numbers of non-ASCII characters. An alternative is to put all the UTF-8 strings in resource files. Eclipse provides a handy wizard for this.

If your UTF8 file contains a BOM (byte order mark) then you will have a problem. It is a known bug , see here and here.
The BOM is optional with UTF8 and most of the time it is not there because it breaks many tools (like Javadoc, XML parser,...).
More info here.

Related

IntelliJ IDEA Non-English Character Problem on Files

I am woking on a spring project, the used language on this projects is Turkish but my systems layout and my keyboard is English. When I try to open project with Intelij IDEA, some characters are not seen, here is the example:
Kullan\u0131m Detaylar\u0131
This is what I see on the Intellij IDEA, but istead It should be like this:
Kullanım Detayları
This is just only one example some JSP files can be even worser,
One of my jsp file looks like this:
<span>Arýza Ýþlemleri</span>
But it should be look like this:
<span>Arıza İşlemleri</span>
This is also effects my WEB output, I tried to change some of the ide properties but nothing really changed.

It is a decision, but best do all in UTF-8. Sources, JSP sources, properties and so on.
Ensure that the compilers, javac, jspc also use UTF-8. This will allow special Unicode characters like bullets, emoji and many more.
You could still deliver a Turkish page in a Turkish encoding like ISO-8859-3. But UTF-8 would do too.
\u0131 (dotless i, ı) is an encoding to ASCII. You can use this to test that the encodings are correct.
In IntelliJ set all the encoding in the preferences and the project. If you have an infrastructure like maven or gradle, set it there. For the compiler.
Possibly check with a programmer's editor like NotePad++ that the encoding is correct.
And last but not least check the database, tables/columns stored as UTF-8.
User form entry should also accept UTF-8, so the data does not go encoded as numeric entities to the server (=more efficient).

Do not use Turkish characters, instead use their unicode representatives.For example for "ı" use "\u+0131"

Java string encoding(getBytes not working)

Now I have a return message in Chinese from API, it used big5 to encode.
Unfortunately, my web page used UTF-8, so it cant't show properly.
I google this question many times, so I tried different getBytes function many times, and I don't want to use file system to handle it as far as possible. So can anyone supply an effective solution to me?
My JDK version is 1.7, and sorry that this version can't change in this project.

You should be using the encoding/decoding methods present in https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

using java for Arabic NLP

I'm working on Arabic natural language processing such as word stemming, tokenization etc.
In order to deal with words/chars, I need to write arabic letters in java. So, my question is that is it a good practice to write arabic letters in java directly without encoding?
example:
which one is better:
if(word.startsWith("ت"){...}
or
if(word.startsWith("\u1578"){...}

You have to write Arabic letters for the sake of readability. As for the machine, there is no big difference. Also set your character coding to UTF-8 as Arabic characters have issues with ASCII coding set.
If you are familiar with Python, then NLTK module will be of great help to you.

I would go with the real characters in your master copy, ensuring your compiler is configured for the correct encoding. You can always run it through native2ascii if you need the escaped version for any reason. Once you get going you may well find you don't actually have that many hard-coded strings in the source code, as things like gazetteer lists of potential named entities etc. are better represented as external text files.
GATE has a basic named entity annotation plugin for Arabic which may be a good starting point for your work (full disclosure: I'm one if the GATE core development team).

Is there a way to build UTF-8 source files using ant

I have UTF-8 string literals hardcoded in my java files.
Eclipse builds this application correctly, so resulting class files contains those strings in UTF-8.
But If use ant build.xml, resulting class files contains strings with incorrect encoding.
I already tried adding encoding="UTF-8" to the javac task, but with no success.
How it can be fixed?
p.s. I know this is quite bad practice to have string literals hardcoded in the source files, but this is situation when I need it there, so please don't suggest to extract it to the resource bundle.
Any help is greatly appreciated

Proper way is
<javac ... encoding="UTF-8" ... />
If in resulting class files strings are in incorrect encoding, it means that probably your source encoding is not UTF-8, or these files are compiled by some other javac task, not the one you modified.

The encoding must match the encoding of the file. Your guess of utf-8 may not be correct. Please check other suitable encoding names like iso-8859-1.
Please check upvoted answers in How do I set -Dfile.encoding within ant's build.xml?

This is not actually an answer to original problem, but I don't want this question to be unanswered. Maybe moderators will decide to delete it. Anyway.
Looks like this is some kind of bug when using specific combination of ant, jdk, windows.
I dug really deep and wasn't able to fix this in a let's say normal way.
So I decided to externalize string to properties file, which is anyway better practice...
Normally solution suggested by Mikhail Vladimirov should work fine, but not this time.

How to handle multiple languages in Java apps?

I am writing a program use JSP and Java. How can I use property files to support multiple languages?
And by the way, there are always some things like \u4345.
What is this? How do they come?

For the multiple languages, check out the ResourceBundle class.
About the \u4345, this is one of the dark and very annoying legacy corners of Java. The property files need to be in ASCII, so that all non-ASCII characters need to encoded as \uxxxx (their Unicode value). You can convert a file to use this encoding with the native2ascii command line tool.
If you are using an IDE or a build tool, there should be an option to invoke this automatically.
If the property file is something you have full control over yourself, you can starting from Java6 also use UTF-8 (or any other character set) directly in the property file, and specify that encoding when you load it:
// new in Java6
props.load(new InputStreamReader(new FileInputStream(file), 'UTF-8'));
Again, this only works if you load the Properties yourself, not if someone else does it, such as a ResourceBundle (used for internationalization).

there is an entire tutorial on http://java.sun.com/docs/books/tutorial/i18n/index.html
This specifies and explains about anything you need to know.

The Java tutorial on i18n has been mentioned already by Peter. If you are building JSPs you probably want to look at the JSTL which basically allows you to use the functionality of ResourceBundle through JSP tags.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java problems with UTF-8 in different OS - java

Not sure about gedit, but you can certainly configure eclipse to use whatever encoding you like for source code. It's part of the project properties (and saved in the .settings directory within the project).

If your UTF8 file contains a BOM (byte order mark) then you will have a problem. It is a known bug , see here and here. The BOM is optional with UTF8 and most of the time it is not there because it breaks many tools (like Javadoc, XML parser,...). More info here.

Related

IntelliJ IDEA Non-English Character Problem on Files

Java string encoding(getBytes not working)

using java for Arabic NLP

Is there a way to build UTF-8 source files using ant

How to handle multiple languages in Java apps?

Categories

Resources