I need to read chinese character through java input fields. I have installed chinese
in my system and I have set system locale to chines. I could type chinese character using
US key board in notepad. When type in my application I get only the english character. How to make java input fields to read chinese character. Please help me out.
How to make java input fields to read chinese character.
Perhaps it has something to do with the encoding of your text file. If I have trouble with encodings, I try and create files in either eclipse or OpenOffice, where I am able to specify the encoding. If you are on a mac, you can also use TextEdit, or (even better) Bean.
edit: I use utf8 encoding.
Related
I have a property file which may/ may not contain unicode escaped characters in the values of its keys. Please see the sample below. My job is to ensure that if a value in the property file contains a non-ascii character, then it should be unicode escaped. So, in the sample below, first entry is OK, all entries like the second entry should be removed and converted to like the first entry.
##sample.properties
escaped=cari\u00F1o
nonescaped=cariño
normal=darling
Essentially my question is how can I differentiate in Java between cari\u00F1o and cariño since as far as Java is concerned it treats them as identical.
Properties files in Java must be saved in the ISO-8859-1 character set for Java to read them properly. That means that it is possible to use special characters from Western European languages without escaping them. It is not possible to use characters from other languages such as those from Easter Europe, Russia, or China without escaping them.
As such there are only a few non-ascii characters that can appear in a properties file without being escaped.
To detect whether characters have been escaped or not, you will need to open the properties file directly, rather than through the Properties class. The Properties class does all the unescaping for you when you load a file through it. You should open them using the File class or though System.getResourceAsStream as an InputStream. Once you do so you can scan through the input stream one byte at a time and ensure that all bytes are in the 0x20-0x7E range plus new lines \r and \n which is the ASCII range of characters you would expect in a properties file.
I would suggest that your translators don't try to write properties files directly. They should provide you with documents like spreadsheets that you convert into properties file. Or they could use a translation editor such as Attesoro (which I wrote) to let them save the properties files properly escaped.
You could simply use the native2ascii tool, which performs exactly this conversion (it will convert all non-ASCII characters to escapes but leave existing escapes intact).
Your problem is that the Java Properties class decodes the properties files, assuming ISO-8859-1 encoding, and parsing escaped unicode characters.
So from a Properties point of view, these two strings are indeed the same.
I believe if you need to differentiate these two, you will need to write your own parser.
It's actually a feauture that you do not need to care by default. The one thing that strikes me as the most odd is that the (only) encoding is ISO-8859-1, probably for historical reasons.
The library ICU4J seems to be what you're looking for. See the Normalization page.
Is there a possibility to do that in Eclipse? I have a lot of non utf8 characters like sch�ma or propri�t� (it's french :)). For now, I am deleting those characters hand. How can I remove those characters?
Those characters are in the UTF-8 character set.
Either the text is encoded incorrectly or you have your file encoding set incorrectly in Eclipse.
Try right clicking the file -> properties. Then check the text file encoding is set to UTF-8, if its not, select Other and change it to UTF-8.
I would write a little program that reads a file, removes all char > 127 and write back to the file.
[I would pass the file names as command line arguments]
My client uses InputStreamReader/BufferedReader to fetch text from the Internet.
However when I save the Text to a *.txt the text shows extra weird special symbols like 'Â'.
I've tried Convert the String to ASCII but that mess upp å,ä,ö,Ø which I use.
I've tried food = food.replace("Â", ""); and IndexOf();
But string won't find it. But it's there in HEX Editor.
So summary: When I use text.setText(Android), the output looks fine with NO weird symbols, but when I save the text to *.txt I get about 4 of 'Â'. I do not want ASCII because I use other Non-ASCII character.
The 'Â' is displayed as a Whitespace on my Android and in notepad.
Thanks!
Have A great Weekend!
EDIT:
Solved it by removing all Non-breaking-spaces:
myString.replaceAll("\\u00a0"," ");
You say that you are fetching like this:
in = new BufferedReader(new InputStreamReader(url.openStream(),"UTF-8"));
There is a fair chance that the stuff you are fetching is not encoded in UTF-8.
You need to call getContentType() on the HttpURLConnection object, and if it is non-null, extract the encoding and use it when you create the InputStreamReader. Only assume "UTF-8" if the response doesn't supply a content type with a valid encoding.
On reflection, while you SHOULD pay attention to the content type returned by the server, the real problem is either in the way that you are writing the *.txt file, or in the display tool that is showing strange characters.
It is not clear what encoding you are using to write the file. Perhaps you have chosen the wrong one.
It is possible that the display tool is assuming that the file has a different encoding. Maybe it detects that a file is UTF-8 or UTF-16 is there is a BOM.
It is possible that the display tool is plain broken, and doesn't understand non-breaking spaces.
When you display files using a HEX editor, it is most likely using an 8-bit character set to render bytes, and that character set is most likely Latin-1. But apparently, the file is actually encoded differently.
Anyway, the approach of replacing non-breaking spaces is (IMO) a hack, and it won't deal with other stuff that you might encounter in the future. So I recommend that you take the time to really understand the problem, and fix it properly.
Finally, I think I understand why you might be getting  characters. A Unicode NON-BREAKING-SPACE character is u00a0. When you encode that as UTF-8, you get C2 A0. But C2 in Latin-1 is CAPITAL-A-CIRCUMFLEX, and A0 in Latin-1 is NON-BREAKING-SPACE. So the "confusion" is most likely that your program is writing the *.txt file in UTF-8 and the tool is reading it as Latin-1.
I am testing a piece of Java code and need to create an array of strings. These strings are words in different languages, including those like Arabic with the right-to-left reading order (don't know if that matters...)
So I need to do something like this:
ArrayList<String> words = ...
words.add(<word-in-english>);
words.add(<word-in-chinese>);
words.add(<word-in-russian>);
words.add(<word-in-arabic>);
What's the best way to put these into my Java code? Is there a way to do it other than using "\u" escape for every character in a string? Thanks
You can set the encoding of editor/IDE to UTF-8, and java compiler too. For international projects this begins to become more and more a convention.
Unfortunately you would need to set your IDE font to a full unicode font which might be 35 MB or such. Or use for a missing chinese "\uXXXX" escaping, using native2ascii.
Depending on your sources, you might use files per language.
In order for it to work you must do these 2 things:
Save the source file in Unicode format (UTF-8). How to do this is IDE/Text Editor dependent.
Compile the file by specifying the UTF-8 charset. Like this:
javac -encoding utf-8 MyFile.java
As far as I know there is no problem to put any Unicode characters into your java code including RTL languages. It a little bit depends on your IDE but I believe that all modern IDEs support RTL typing. At least Eclipse does.
You have to save your source code using UTF-8 charset. Again it depends on your IDE. I eclipse right-click on file, then choose resource and change its encoding to UTF-8.
Sometimes it is just not convenient to type RTL texts in IDE. In this case type text using other program (MS Word, Notepad etc) and then copy and paste it into java code.
BTW think about storing unicode strings in separate resouce file. It is usually more convenient.
shouldn't something like this work:
BufferedReader bufReader =
new BufferedReader(
new InputStreamReader(new FileInputStream(file_name), "UTF-16"));
Pay attention to UTF-16 .
when dealing with non-english filename.
The problem is that my program cannot gurantee those directories and filenames are in English, if some filenames using japanese, chinese character it will display some character like '?'.
anybody can suggest me wat i need to do to access non english file name
The problem is that my program cannot guarantee those directories and filenames are in English. If a filename use japanese, chinese characters it will display some character like '?'.
The problem is apparently that "it" is using the wrong character set to display the filenames. The solution depends on whether "it" is your program (via a GUI), some other application, the command shell / terminal emulator, or the user's web browser. If you could provide more information, maybe I could offer some suggestions.
But turning the characters into underscores is most likely a bad solution. It is liable to lead to filename clashes, and those Chinese / Japanese / etc characters are most likely meaningful to the people who created the files.
By the way, the correct term for "english" letters is Latin.
EDIT
For your use-case, you don't to store the PDF file using a filename that bears any relation to the supplied filename. I suggest that you try to solve the problem by using a filename consisting of Latin numbers and letters generated from (say) currentTimeInMillis(). If that fails, then your real problem has nothing to do with the filenames at all.
EDIT 2
You ask about the statement
if (fileName.startsWith("=?iso-8859"))
This seems to be trying to unpick a filename in MIME encoded-word format; see RFC 2047 Section 2
Firstly, I think that code may be unnecessary. The javadoc is not specific, but I think that the Part.getFilename() method should deal with decoding of the filename.
Second, if the decoding is necessary, then you are going about it the wrong way. The stuff after the charset cannot simply be treated as the value of the filename. Look at the RFC.
Third, if you need to you should use the relevant MimeUtility methods to decode "word" tokens ... like the filename.
Fourthly, ISO-8859-1 is NOT a suitable encoding for characters in non-Latin character sets.
Finally, examine the raw email headers of the emails that you are trying to decode and look for the header line that starts
Content-Disposition: attachment; filename=...
If the filename looks like "=?iso-8859-1?...", and the filename is supposed to contain japanese / chinese / etc characters, then the problem is in the client (or whatever) that constructed the email. The character set needs to be "utf-8" or one of the other multibyte character sets.
Java uses Unicode natively - you don't need to replace special characters, as Unicode has no special characters - every code point is treated equally. Your replaceSpChars() may be the culprit here.