IntelliJ IDEA Non-English Character Problem on Files - java

I am woking on a spring project, the used language on this projects is Turkish but my systems layout and my keyboard is English. When I try to open project with Intelij IDEA, some characters are not seen, here is the example:
Kullan\u0131m Detaylar\u0131
This is what I see on the Intellij IDEA, but istead It should be like this:
Kullanım Detayları
This is just only one example some JSP files can be even worser,
One of my jsp file looks like this:
<span>Arýza Ýþlemleri</span>
But it should be look like this:
<span>Arıza İşlemleri</span>
This is also effects my WEB output, I tried to change some of the ide properties but nothing really changed.

It is a decision, but best do all in UTF-8. Sources, JSP sources, properties and so on.
Ensure that the compilers, javac, jspc also use UTF-8. This will allow special Unicode characters like bullets, emoji and many more.
You could still deliver a Turkish page in a Turkish encoding like ISO-8859-3. But UTF-8 would do too.
\u0131 (dotless i, ı) is an encoding to ASCII. You can use this to test that the encodings are correct.
In IntelliJ set all the encoding in the preferences and the project. If you have an infrastructure like maven or gradle, set it there. For the compiler.
Possibly check with a programmer's editor like NotePad++ that the encoding is correct.
And last but not least check the database, tables/columns stored as UTF-8.
User form entry should also accept UTF-8, so the data does not go encoded as numeric entities to the server (=more efficient).

Do not use Turkish characters, instead use their unicode representatives.For example for "ı" use "\u+0131"

Related

Using Freemarker to generate Java .properties files

I am currently using Freemarker to generate a number of configuration files. So far these have been either xml files or properietary format text files. I now would like to generate some Java .properties files but have hit a couple of issues.
The first is character encoding. As far as I can see simply adding
<#ftl encoding="8859_1">
to the start of the file should sort this out.
The second issue is the escaping of the keys and values. The keys are probably ok as I would be hardcoding these in the template anyway so I can escape them in the template. The values will be coming from my data model and so will need escaping.
I can see how I can create my own user defined directive and by installing it as a shared variable use it in my template.
Is this the best or only way to do this? I would have thought generating .properties files is something that has been tackled many times before and was hoping something may already exist before I start writing my own code.
The class java.util.Properties got various store methods to save properties to OutputStreams or files. This seems more preferable than trying to adapt freemarker.
I don't get what are the charset issues that are specific to generating properties files. But note that the charset of the template and the charset of the output are independent, so you might as well use the same charset for these templates as for the others (like maybe UTF-8).
As of escaping, always use auto-escaping if you can. In 2.3.24 that will be especially sleek, but unless you are allowed to use unreleased versions, you had to wait for that until the end of February or so. (If you can use unreleased/unofficial versions, you can find out about the internal testing releases in the developer list archive.) Before 2.3.24, there's <#escape x as propEsc(x)>all the template content here</#escape>, where propEsc is a TemplateMethodModelEx (not a TemplateDirectiveModel) that you have added as shared variable or such. And so all ${...}-s will be magically escaped.

using java for Arabic NLP

I'm working on Arabic natural language processing such as word stemming, tokenization etc.
In order to deal with words/chars, I need to write arabic letters in java. So, my question is that is it a good practice to write arabic letters in java directly without encoding?
example:
which one is better:
if(word.startsWith("ت"){...}
or
if(word.startsWith("\u1578"){...}
You have to write Arabic letters for the sake of readability. As for the machine, there is no big difference. Also set your character coding to UTF-8 as Arabic characters have issues with ASCII coding set.
If you are familiar with Python, then NLTK module will be of great help to you.
I would go with the real characters in your master copy, ensuring your compiler is configured for the correct encoding. You can always run it through native2ascii if you need the escaped version for any reason. Once you get going you may well find you don't actually have that many hard-coded strings in the source code, as things like gazetteer lists of potential named entities etc. are better represented as external text files.
GATE has a basic named entity annotation plugin for Arabic which may be a good starting point for your work (full disclosure: I'm one if the GATE core development team).

fusion chart not supporting chinese characters

Fusion Chart not supporting chinese characters , I am using spring framework for this task, I am generating xml file dynamically using dom4j , when I use chinese character for my chart its shows some other characters,if using english means no problem will occur. kindly help me to solve this problem.
Did you specify proper encoding when reading or writing the file? The safest way to
handle chinese characters is to always provide encoding (eg. utf-8) while reading or
writing the file. Once you have a valid java string, you dont need to worry about encoding
anymore.
Anyway can you post the chinese characters you are dealing with. They might be special ones
that need special handling.

Java problems with UTF-8 in different OS

I'm programing with other people an application to college homework, and sometime we use non-english characters in comments or in Strings displayed in the views. The problem is that everyone of use is using a different OS and sometimes different IDE's to program.
Concretely, one is using MacOS, another Windows7, and another and me Ubuntu Linux. Furthermore, all of them use Eclipse and I use gedit. We have no idea if Eclipse or gedit are configurable to work propertly with UTF8 bussiness, at least I don't found nothing for mine.
The fact is that what I write with non-english characters appears in Windows & MacOS virtual machines with strange symbols and vice-versa , and sometimes, what my non-linux friends write provokes compilation warnings like this: warning: unmappable character for encoding UTF8.
Do you have any Idea to solve this? It is not really urgent but it will be a help.
Thank you.
Not sure about gedit, but you can certainly configure eclipse to use whatever encoding you like for source code. It's part of the project properties (and saved in the .settings directory within the project).
Eclipse works fine with UTF-8. See Michael's answer about configuring it. Maybe for Windows and/or MacOS it is really necessary. Ubuntu uses UTF-8 as the default encoding so I don't think it's necessary to configure Eclipse there.
As for Gedit, this picture shows that it is possible to change the encoding when saving a file in Gedit.
Anyway, you need to make sure that all of you use UTF-8 for your sources. This is the only reasonable way to achieve cross-platform portability of your sources.
You could avoid the issue in Strings by using character escape sequences, and using only ASCII encoding for the files.
For example, an en dash can be expressed as "\u2013".
You can quickly search for the Java code for individual characters here.
As Sergey notes below, this works best for small numbers of non-ASCII characters. An alternative is to put all the UTF-8 strings in resource files. Eclipse provides a handy wizard for this.
If your UTF8 file contains a BOM (byte order mark) then you will have a problem. It is a known bug , see here and here.
The BOM is optional with UTF8 and most of the time it is not there because it breaks many tools (like Javadoc, XML parser,...).
More info here.

How to handle multiple languages in Java apps?

I am writing a program use JSP and Java. How can I use property files to support multiple languages?
And by the way, there are always some things like \u4345.
What is this? How do they come?
For the multiple languages, check out the ResourceBundle class.
About the \u4345, this is one of the dark and very annoying legacy corners of Java. The property files need to be in ASCII, so that all non-ASCII characters need to encoded as \uxxxx (their Unicode value). You can convert a file to use this encoding with the native2ascii command line tool.
If you are using an IDE or a build tool, there should be an option to invoke this automatically.
If the property file is something you have full control over yourself, you can starting from Java6 also use UTF-8 (or any other character set) directly in the property file, and specify that encoding when you load it:
// new in Java6
props.load(new InputStreamReader(new FileInputStream(file), 'UTF-8'));
Again, this only works if you load the Properties yourself, not if someone else does it, such as a ResourceBundle (used for internationalization).
there is an entire tutorial on http://java.sun.com/docs/books/tutorial/i18n/index.html
This specifies and explains about anything you need to know.
The Java tutorial on i18n has been mentioned already by Peter. If you are building JSPs you probably want to look at the JSTL which basically allows you to use the functionality of ResourceBundle through JSP tags.

Categories