fusion chart not supporting chinese characters - java

Fusion Chart not supporting chinese characters , I am using spring framework for this task, I am generating xml file dynamically using dom4j , when I use chinese character for my chart its shows some other characters,if using english means no problem will occur. kindly help me to solve this problem.

Did you specify proper encoding when reading or writing the file? The safest way to
handle chinese characters is to always provide encoding (eg. utf-8) while reading or
writing the file. Once you have a valid java string, you dont need to worry about encoding
anymore.
Anyway can you post the chinese characters you are dealing with. They might be special ones
that need special handling.

Related

IntelliJ IDEA Non-English Character Problem on Files

I am woking on a spring project, the used language on this projects is Turkish but my systems layout and my keyboard is English. When I try to open project with Intelij IDEA, some characters are not seen, here is the example:
Kullan\u0131m Detaylar\u0131
This is what I see on the Intellij IDEA, but istead It should be like this:
Kullanım Detayları
This is just only one example some JSP files can be even worser,
One of my jsp file looks like this:
<span>Arýza Ýþlemleri</span>
But it should be look like this:
<span>Arıza İşlemleri</span>
This is also effects my WEB output, I tried to change some of the ide properties but nothing really changed.
It is a decision, but best do all in UTF-8. Sources, JSP sources, properties and so on.
Ensure that the compilers, javac, jspc also use UTF-8. This will allow special Unicode characters like bullets, emoji and many more.
You could still deliver a Turkish page in a Turkish encoding like ISO-8859-3. But UTF-8 would do too.
\u0131 (dotless i, ı) is an encoding to ASCII. You can use this to test that the encodings are correct.
In IntelliJ set all the encoding in the preferences and the project. If you have an infrastructure like maven or gradle, set it there. For the compiler.
Possibly check with a programmer's editor like NotePad++ that the encoding is correct.
And last but not least check the database, tables/columns stored as UTF-8.
User form entry should also accept UTF-8, so the data does not go encoded as numeric entities to the server (=more efficient).
Do not use Turkish characters, instead use their unicode representatives.For example for "ı" use "\u+0131"

using java for Arabic NLP

I'm working on Arabic natural language processing such as word stemming, tokenization etc.
In order to deal with words/chars, I need to write arabic letters in java. So, my question is that is it a good practice to write arabic letters in java directly without encoding?
example:
which one is better:
if(word.startsWith("ت"){...}
or
if(word.startsWith("\u1578"){...}
You have to write Arabic letters for the sake of readability. As for the machine, there is no big difference. Also set your character coding to UTF-8 as Arabic characters have issues with ASCII coding set.
If you are familiar with Python, then NLTK module will be of great help to you.
I would go with the real characters in your master copy, ensuring your compiler is configured for the correct encoding. You can always run it through native2ascii if you need the escaped version for any reason. Once you get going you may well find you don't actually have that many hard-coded strings in the source code, as things like gazetteer lists of potential named entities etc. are better represented as external text files.
GATE has a basic named entity annotation plugin for Arabic which may be a good starting point for your work (full disclosure: I'm one if the GATE core development team).

Right way to deal with Unicode BOM in a text file

I am reading a text file in my program which contains some Unicode BOM character \ufeff/65279 in places. This presents several issues in further parsing.
Right now I am detecting and filtering these characters myself but would like to know if Java standard library or Guava has a way to do this more cleanly.
There is no built in way of dealing with a (UTF-8) BOM in Java or, indeed, in Guava.
There is currently a bug report on the Guava website about dealing with a BOM in Guava IO.
There are several SO posts (here and here) on how to detect/skip the BOM while reading a file in plain Java.
Your BOM (\ufeff) seems to be UTF-16 which, according to the same Guava report should be dealt with automatically by Java. This SO post seems suggest the same.

Java problems with UTF-8 in different OS

I'm programing with other people an application to college homework, and sometime we use non-english characters in comments or in Strings displayed in the views. The problem is that everyone of use is using a different OS and sometimes different IDE's to program.
Concretely, one is using MacOS, another Windows7, and another and me Ubuntu Linux. Furthermore, all of them use Eclipse and I use gedit. We have no idea if Eclipse or gedit are configurable to work propertly with UTF8 bussiness, at least I don't found nothing for mine.
The fact is that what I write with non-english characters appears in Windows & MacOS virtual machines with strange symbols and vice-versa , and sometimes, what my non-linux friends write provokes compilation warnings like this: warning: unmappable character for encoding UTF8.
Do you have any Idea to solve this? It is not really urgent but it will be a help.
Thank you.
Not sure about gedit, but you can certainly configure eclipse to use whatever encoding you like for source code. It's part of the project properties (and saved in the .settings directory within the project).
Eclipse works fine with UTF-8. See Michael's answer about configuring it. Maybe for Windows and/or MacOS it is really necessary. Ubuntu uses UTF-8 as the default encoding so I don't think it's necessary to configure Eclipse there.
As for Gedit, this picture shows that it is possible to change the encoding when saving a file in Gedit.
Anyway, you need to make sure that all of you use UTF-8 for your sources. This is the only reasonable way to achieve cross-platform portability of your sources.
You could avoid the issue in Strings by using character escape sequences, and using only ASCII encoding for the files.
For example, an en dash can be expressed as "\u2013".
You can quickly search for the Java code for individual characters here.
As Sergey notes below, this works best for small numbers of non-ASCII characters. An alternative is to put all the UTF-8 strings in resource files. Eclipse provides a handy wizard for this.
If your UTF8 file contains a BOM (byte order mark) then you will have a problem. It is a known bug , see here and here.
The BOM is optional with UTF8 and most of the time it is not there because it breaks many tools (like Javadoc, XML parser,...).
More info here.

Java File parsing toolkit design, quick file encoding sanity check

(Disclaimer: I looked at a number of posts on here before asking, I found this one particularly helpful, I was just looking for a bit of a sanity check from you folks if possible)
Hi All,
I have an internal Java product that I have built for processing data files for loading into a database (AKA an ETL tool). I have pre-rolled stages for XSLT transformation, and doing things like pattern replacing within the original file. The input files can be of any format, they may be flat data files or XML data files, you configure the stages you require for the particular datafeed being loaded.
I have up until now ignored the issue of file encoding (a mistake I know), because all was working fine (in the main). However, I am now coming up against file encoding issues, to cut a long story short, because of the nature of the way stages can be configured together, I need to detect the file encoding of the input file and create a Java Reader object with the appropriate arguments. I just wanted to do a quick sanity check with you folks before I dive into something I can't claim to fully comprehend:
Adopt a standard file encoding of UTF-16 (I'm not ruling out loading double-byte characters in the future) for all files that are output from every stage within my toolkit
Use JUniversalChardet or jchardet to sniff the input file encoding
Use the Apache Commons IO library to create a standard reader and writer for all stages (am I right in thinking this doesn't have a similar encoding-sniffing API?)
Do you see any pitfalls/have any extra wisdom to offer in my outlined approach?
Is there any way I can be confident of backwards compatibility with any data loaded using my existing approach of letting the Java runtime decide the encoding of windows-1252?
Thanks in advance,
-James
With flat character data files, any encoding detection will need to rely on statistics and heuristics (like the presence of a BOM, or character/pattern frequency) because there are byte sequences that will be legal in more than one encoding, but map to different characters.
XML encoding detection should be more straightforward, but it is certainly possible to create ambiguously encoded XML (e.g. by leaving out the encoding in the header).
It may make more sense to use encoding detection APIs to indicate the probability of error to the user rather than rely on them as decision makers.
When you transform data from bytes to chars in Java, you are transcoding from encoding X to UTF-16(BE). What gets sent to your database depends on your database, its JDBC driver and how you've configured the column. That probably involves transcoding from UTF-16 to something else. Assuming you're not altering the database, existing character data should be safe; you might run into issues if you intend parsing BLOBs. If you've already parsed files written in disparate encodings, but treated them as another encoding, the corruption has already taken place - there are no silver bullets to fix that. If you need to alter the character set of a database from "ANSI" to Unicode, that might get painful.
Adoption of Unicode wherever possible is a good idea. It may not be possible, but prefer file formats where you can make encoding unambiguous - things like XML (which makes it easy) or JSON (which mandates UTF-8).
Option 1 strikes me as breaking backwards compatibility (certainly in the long run), although the "right way" to go (the right way option generally does break backwards compatibility) with perhaps additional thoughts about if UTF-8 would be a good choice.
Sniffing the encoding strikes me as reasonable if you have a limited, known set of encodings that you tested to know that your sniffer correctly distinguishes and identifies.
Another option here is to use some form of meta-data (file naming convention if nothing else more robust is an option) that lets your code know that the data was provided according to the UTF-16 standard and behave accordingly, otherwise convert it to the UTF-16 standard before moving forward.

Categories