how to read utf-8 chars in opencsv

how to read utf-8 chars in opencsv - java

I am trying to read from csv file. The file contains UTF-8 characters. So based on Parse CSV file containing a Unicode character using OpenCSV and How read Japanese fields from CSV file into java beans? I just wrote
CSVReader reader = new CSVReader(new InputStreamReader(new FileInputStream("data.csv"), "UTF-8"), ';');
But it does not work. The >>Sí, es nuevo<< text is visible correctly in Notepad, Excel and various other text editing tools, but when I parse the file via opencsv I'm getting >>S�, es nuevo<< ( The í is a special character if you were wondering ;)
What am I doing wrong?

you can use encoder=UTF-16LE,I'm write a file for Japanese

Thanks aioobe. It turned out the file was not really UTF-8 despite most Win programs showing it as such. Notepad++ was the only one that did not show the file as UTF-8 encoded and after converting the data file the code works.

Use the below code for your issue it might helpful to you...
String value = URLEncoder.encode(msg[no], "UTF-8");
thanks,
Yash

Use ISO-8859-1 or ISO-8859-14 or ISO-8859-15 or ISO-8859-10 or ISO-8859-13 or ISO-8859-2 instead of using UTF-8

Related

How to convert MultipartFile to UTF-8 always while using CSVFormat?

I am using a spring boot REST API to upload csv file MultipartFile. CSVFormat library of org.apache.commons.csv is used to format the MultipartFile and CSVParser is used to parse and the iterated records are stored into the MySql database.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream()));
Observation is that when the CSV files are uploaded with charset of UTF-8 then it works good. But if the CSV file is of a different format (ANSI etc.,) other than it, its encoding German and other language characters to some random symbols.
Example äößü are encoded to ����
I tried the below to specify the encoding standard, it did not work too.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8));
Can you please advise. Thank you so much in advance.

What you did new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8) tells the CSV parser that the content of the inputstream is UTF-8 encoded.
Since UTF-8 is (usally) the standard encoding, this is actually the same as using new InputStreamReader(csvFile.getInputStream()).
If I get your question correctly, this is not what you intended. Instead you want to automatically choose the right encoding based on the Import-file, right?
Unfortunatelly the csv-format does not store the information which encoding was used.
There are some libraries you could use to guess the most probable encoding based on the characters contained in the file. While they are pretty accurate, they are still guessing and there is no guarantee that you will get the right encoding in the end.
Depending on your use case it might be easier to just agree with the consumer on a fixed encoding (i.e. they can upload UTF-8 or ANSI, but not both)

Try as shown below which worked for me for the same issue
new InputStreamReader(csvFile.getInputStream(), "UTF-8")

Reading and Writing Text files with UTF-16LE encoding and Apache Commons IO

I have written an application in Java and duplicated it in C#. The application reads and writes text files with tab delimited data to be used by an HMI software. The HMI software requires UTF or ANSI encoding for the degree symbol to be displayed correctly or I would just use ASCII which seems to work fine. The C# application can open files saved by either with no problem. The java application reads files it saved perfectly but there is a small problem that crops up when reading the files saved with C#. It throws a numberformatexception when parsing the first character in the file to and int. This character is always a "1". I have opened both files up with editpadlight and they appear to be identical even when viewed with encoding and the encoding is UTF-16LE. I'm racking my brain on this, any help would be appreciated.
lines = FileUtils.readLines(file, "UTF-16LE");
Integer.parseInt(line[0])
I cannot see any difference between the file saved in C# and the one saved in Java
Screen Shot of Data in EditPad Lite
if(lines.get(0).split("\\t")[0].length() == 2){
lines.set(0, lines.get(0).substring(1));
}

Your .NET code is probably writing a BOM. Compliant readers of Unicode, strip off any BOM since it is meta-data, not part of the text data.
Your Java code explicitly specifies the byte order
FileUtils.readLines(file, "UTF-16LE");
It's somewhat of a Catch-22; If the source has a BOM then you can read it as "UTF-16". If it doesn't then you can read it as "UTF-16LE" or "UTF-16BE" as you know which it is.
So, either write it with a BOM and read it without specifying the byte order, or, write it without a BOM and read it specifying the byte order.
With a BOM:
[C#]
File.WriteAllLines(file, lines, Encoding.Unicode);
[Java]
FileUtils.readLines(file, "UTF-16");
Without a BOM:
[C#]
File.WriteAllLines(file, lines, new UnicodeEncoding(false));
[Java]
FileUtils.readLines(file, "UTF-16LE");

In my java code I read the file normally, I just specified char encoding into the InputStreamReader
File file = new File(fileName);
InputStreamReader fis = new InputStreamReader(new FileInputStream(file), "UTF-16LE");
br = new BufferedReader(fis);
String line = br.readLine();

Java's UTF-8 encoding

I have this code:
BufferedWriter w = Files.newWriter(file, Charsets.UTF_8);
w.newLine();
StringBuilder sb = new StringBuilder();
sb.append("\"").append("éééé").append("\";")
w.write(sb.toString());
But it ain't work. In the end my file hasn't an UTF-8 encoding. I tried to do this when writing:
w.write(new String(sb.toString().getBytes(Charsets.US_ASCII), "UTF8"));
It made question marks appear everywhere in the file...
I found that there was a bug regarding the recognition of the initial BOM charcater (http://bugs.java.com/view_bug.do?bug_id=4508058), so I tried using the BOMInputStream class. But bomIn.hasBOM() always returns false, so I guess my problem is not BOM related maybe?
Do you know how I can make my file encoded in UTF-8? Was the problem solved in Java 8?

You're writing UTF-8 correctly in your first example (although you're redundantly creating a String from a String)
The problem is that the viewer or tool you're using to view the file doesn't read the file as UTF-8.
Don't mix in ASCII, that just converts all the non-ASCII bytes to question marks.

Same unicode character behaves differently in different IDEs

When I read the following unicode string it reads as differently..When I execute the program using netbeans it is working fine but when I tried using Eclipse / directly from CMD it is not working.
After reading it adds ƒÂ these characters
Then the string becomes MÃƒÂ½xico
String to be read is MÃ½xico...I used the CSVReader with Encoding to read as follows.
sourceReader = new CSVReader(new FileReader(soureFile));
List<String[]> data = sourceReader.readAll();
Any suggestions????????

It sounds like the different editors are using different encodings. For example one is using utf-8 and one is using something else.
Check the encoding settings in all of the editors are the same.

We should use encoding while reading the file. So above statment should be changed as follows
targetReader=new CSVReader(new InputStreamReader(
new FileInputStream(targetFile), "UTF-8"));
data = targetReader.readAll();

opencsv CSVWriter using utf-8 doesn't seem to work for multiple languages

I have a very annoying encoding problem using opencsv.
When I export a csv file, I set character type as 'UTF-8'.
CSVWriter writer = new CSVWriter(new OutputStreamWriter("D:/test.csv", "UTF-8"));
but when I open the csv file with Microsoft Office Excel 2007, it turns out that it has 'UTF-8 BOM' encoding?
Once I save the file in Notepad and re-open, the file turns back to UTF-8 and all the letters in it appears fine.
I think I've searched enough, but I haven't found any solution to prevent my file from turning into 'UTF-8 BOM'. any ideas, please?

I suppose your file has a 'UTF-8 without BOM' encoding.
You better feed BOM encoding to your file, even though it's not necessary in most cases, but only one obvious exception is when you deal with ms excel.
FileOutputStream os = new FileOutputStream(file);
os.write(0xef);
os.write(0xbb);
os.write(0xbf);
CSVWriter csvWrite = new CSVWriter(new OutputStreamWriter(os));
Now your file will be understood by excel as utf-8 csv.

UTF-8 and UTF-8 Signature (which incorrectly named sometimes as UTF-8 BOM) are same encodings, and signature is used only to distinguish it from any other encodings. Any unicode application should process UTF-8 signature (which is three bytes sequence EF BB BF) correctly.
Why Java is specifically adds this signature and how to stop it doing that I don't know.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to read utf-8 chars in opencsv - java

you can use encoder=UTF-16LE,I'm write a file for Japanese

Thanks aioobe. It turned out the file was not really UTF-8 despite most Win programs showing it as such. Notepad++ was the only one that did not show the file as UTF-8 encoded and after converting the data file the code works.

Use the below code for your issue it might helpful to you... String value = URLEncoder.encode(msg[no], "UTF-8"); thanks, Yash

Use ISO-8859-1 or ISO-8859-14 or ISO-8859-15 or ISO-8859-10 or ISO-8859-13 or ISO-8859-2 instead of using UTF-8

Related

How to convert MultipartFile to UTF-8 always while using CSVFormat?

Reading and Writing Text files with UTF-16LE encoding and Apache Commons IO

Java's UTF-8 encoding

Same unicode character behaves differently in different IDEs

opencsv CSVWriter using utf-8 doesn't seem to work for multiple languages

Categories

Resources