Same unicode character behaves differently in different IDEs - java

When I read the following unicode string it reads as differently..When I execute the program using netbeans it is working fine but when I tried using Eclipse / directly from CMD it is not working.
After reading it adds ƒÂ these characters
Then the string becomes Mýxico
String to be read is Mýxico...I used the CSVReader with Encoding to read as follows.
sourceReader = new CSVReader(new FileReader(soureFile));
List<String[]> data = sourceReader.readAll();
Any suggestions????????

It sounds like the different editors are using different encodings. For example one is using utf-8 and one is using something else.
Check the encoding settings in all of the editors are the same.

We should use encoding while reading the file. So above statment should be changed as follows
targetReader=new CSVReader(new InputStreamReader(
new FileInputStream(targetFile), "UTF-8"));
data = targetReader.readAll();

Related

Reading and Writing Text files with UTF-16LE encoding and Apache Commons IO

I have written an application in Java and duplicated it in C#. The application reads and writes text files with tab delimited data to be used by an HMI software. The HMI software requires UTF or ANSI encoding for the degree symbol to be displayed correctly or I would just use ASCII which seems to work fine. The C# application can open files saved by either with no problem. The java application reads files it saved perfectly but there is a small problem that crops up when reading the files saved with C#. It throws a numberformatexception when parsing the first character in the file to and int. This character is always a "1". I have opened both files up with editpadlight and they appear to be identical even when viewed with encoding and the encoding is UTF-16LE. I'm racking my brain on this, any help would be appreciated.
lines = FileUtils.readLines(file, "UTF-16LE");
Integer.parseInt(line[0])
I cannot see any difference between the file saved in C# and the one saved in Java
Screen Shot of Data in EditPad Lite
if(lines.get(0).split("\\t")[0].length() == 2){
lines.set(0, lines.get(0).substring(1));
}
Your .NET code is probably writing a BOM. Compliant readers of Unicode, strip off any BOM since it is meta-data, not part of the text data.
Your Java code explicitly specifies the byte order
FileUtils.readLines(file, "UTF-16LE");
It's somewhat of a Catch-22; If the source has a BOM then you can read it as "UTF-16". If it doesn't then you can read it as "UTF-16LE" or "UTF-16BE" as you know which it is.
So, either write it with a BOM and read it without specifying the byte order, or, write it without a BOM and read it specifying the byte order.
With a BOM:
[C#]
File.WriteAllLines(file, lines, Encoding.Unicode);
[Java]
FileUtils.readLines(file, "UTF-16");
Without a BOM:
[C#]
File.WriteAllLines(file, lines, new UnicodeEncoding(false));
[Java]
FileUtils.readLines(file, "UTF-16LE");
In my java code I read the file normally, I just specified char encoding into the InputStreamReader
File file = new File(fileName);
InputStreamReader fis = new InputStreamReader(new FileInputStream(file), "UTF-16LE");
br = new BufferedReader(fis);
String line = br.readLine();

Java's UTF-8 encoding

I have this code:
BufferedWriter w = Files.newWriter(file, Charsets.UTF_8);
w.newLine();
StringBuilder sb = new StringBuilder();
sb.append("\"").append("éééé").append("\";")
w.write(sb.toString());
But it ain't work. In the end my file hasn't an UTF-8 encoding. I tried to do this when writing:
w.write(new String(sb.toString().getBytes(Charsets.US_ASCII), "UTF8"));
It made question marks appear everywhere in the file...
I found that there was a bug regarding the recognition of the initial BOM charcater (http://bugs.java.com/view_bug.do?bug_id=4508058), so I tried using the BOMInputStream class. But bomIn.hasBOM() always returns false, so I guess my problem is not BOM related maybe?
Do you know how I can make my file encoded in UTF-8? Was the problem solved in Java 8?
You're writing UTF-8 correctly in your first example (although you're redundantly creating a String from a String)
The problem is that the viewer or tool you're using to view the file doesn't read the file as UTF-8.
Don't mix in ASCII, that just converts all the non-ASCII bytes to question marks.

Handling Norwegian and Icelandic letters in Java

In Java,
I am receiving an text input which contains Norwegian Characters and Icelandic characters.
I get a stream and then parse it to String and assign to some variables and again create output.
When i make output, Norwegian and Icelandic characters get distorted and get some ? or ¶ etc. Output files also get same character when opened.
I am making web project .war using Maven. What basic settings are required for Icelandic/Norwegian Text in Coding?
I get a method of setting Locale but unable to produce output using it. Locale.setDefault(new Locale("is_IS", "Iceland"));
Kindly Suggest. How to do it?
Actual Character: HÝS048
Distorted Character: HÃ?S048 (when SOUT directly) or H??S048 (when i get bytes from string and put into string object using UTF-8)
Update (11:13)
I have used
CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
encoder.onMalformedInput(CodingErrorAction.REPORT);
encoder.onUnmappableCharacter(CodingErrorAction.REPORT);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("d:\\try1.csv"),encoder));
out.write(sb.toString());
out.flush();
out.close();
Output: H�S048
Update (12:41):
While reading stream from HTTP source i have used following:
`BufferedReader in = new BufferedReader(new InputStreamReader(apiURL.openStream(), "UTF-8"));`
It perfectly shows output on Console.
I have fetched value of CSV and put it in after logics Bean.
Now I need to create CSV file but when i get values from bean it again gives distorted text. I am using StringBuilder to append the values of bean and write it to file. :( Hope for best. Looking for ideas
The solution to this problem is to get data in UTF-8, print it in UTF-8 and to create file in UTF-8
Read data from URL as below:
BufferedReader in = new BufferedReader(new InputStreamReader(apiURL.openStream(), "UTF-8"));
Then set it to beans or do whatever you want. While printing
System.out.println(new String(sb.toString().getBytes("UTF-8"),"UTF-8"));
Then while creating file again:
FileWriter writer = new FileWriter("d:\\try2.csv");
writer.append(new String(sb.toString().getBytes("UTF-8"),"UTF-8"));
writer.flush();
writer.close();
This is how my problem got resolved.

how to read utf-8 chars in opencsv

I am trying to read from csv file. The file contains UTF-8 characters. So based on Parse CSV file containing a Unicode character using OpenCSV and How read Japanese fields from CSV file into java beans? I just wrote
CSVReader reader = new CSVReader(new InputStreamReader(new FileInputStream("data.csv"), "UTF-8"), ';');
But it does not work. The >>Sí, es nuevo<< text is visible correctly in Notepad, Excel and various other text editing tools, but when I parse the file via opencsv I'm getting >>S�, es nuevo<< ( The í is a special character if you were wondering ;)
What am I doing wrong?
you can use encoder=UTF-16LE,I'm write a file for Japanese
Thanks aioobe. It turned out the file was not really UTF-8 despite most Win programs showing it as such. Notepad++ was the only one that did not show the file as UTF-8 encoded and after converting the data file the code works.
Use the below code for your issue it might helpful to you...
String value = URLEncoder.encode(msg[no], "UTF-8");
thanks,
Yash
Use ISO-8859-1 or ISO-8859-14 or ISO-8859-15 or ISO-8859-10 or ISO-8859-13 or ISO-8859-2 instead of using UTF-8

Use cyrillic .properties file in eclipse project

I'm developing a small project and I'd like to use internationalization for it. The problem is that when I try to use .properties file with cyrillic symbols inside, the text is displayed as rubbish. When I hard-code the strings it's displayed just fine.
Here is my code:
ResourceBundle labels = ResourceBundle.getBundle("Labels");
btnQuit = new JButton(labels.getString("quit"));
And in my .properties file:
quit = Изход
And I get rubbish. When i try
btnQuit = new JButton("Изход);
It is displayed correctly. As far as I am aware, UTF-8 is the encoding used for the files.
Any ideas?
AnyEdit is an eclipse-plugin that allows you to easily convert your your properties files from and to unicode notation. (avoiding the use of command-line tools like native2ascii)
If you were using the Properties class alone (without resource bundle), since Java 1.6 you have the option to load the file with a custom encoding, using a Reader (rather than an InputStream)
I'd guess you can also use new PropertyResourceBundle(reader), rather than ResourceBundle.getBundle(..), where reader is:
Reader reader = new BufferedReader(new InputStreamReader(
getClass().getResourceAsStream("messages.properties"), "utf-8")));
Properties are ISO-8859-1 encoded by default. You must use native2ascii to convert your UTF-8 properties to a valid ISO-8859-1 properties file containing unicode escape sequences for all non-ISO-8859-1 characters.

Categories