How read Japanese fields from CSV file into java beans? - java

I've tried several popular CSV to java deserializers - OpenCSV, JSefa, and Smooks - none correctly read the file:
First Name,Last Name
エリック,山中
花子,鈴木
一郎,鈴木
裕子,田中
政治,山村
into my java object collection.
OpenCsv code:
HeaderColumnNameTranslateMappingStrategy<Contact> strat = new HeaderColumnNameTranslateMappingStrategy<Contact>();
strat.setType(Contact.class);
strat.setColumnMapping(colNameTranslateMap);
InputStreamReader fileReader=null;
CsvToBean<Contact> csv = new CsvToBean<Contact>();
fileReader = new InputStreamReader(new FileInputStream(file), "UTF-8");
contacts = csv.parse(strat, new CSVReader(fileReader));
I've tried setting the Charset to UTF-8, UTF-16 and ISO-8859-1 when I create the FileInputStream, but the collection is never populated properly. As seen in the debugger and System.out the fields contain garbage and often the number of records is wrong.

FileInputStream is for reading streams of binary data, like an mp3 or PNG. Instead of a FIS, use a FileReader for reading streams of characters.
To be blunt: who cares what charsets you tried using if they didn't work? You need to figure out what encoding the CSV file is actually using, and set that encoding when reading the file. To specify the encoding when using a FileReader:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

Related

Reading and Writing Text files with UTF-16LE encoding and Apache Commons IO

I have written an application in Java and duplicated it in C#. The application reads and writes text files with tab delimited data to be used by an HMI software. The HMI software requires UTF or ANSI encoding for the degree symbol to be displayed correctly or I would just use ASCII which seems to work fine. The C# application can open files saved by either with no problem. The java application reads files it saved perfectly but there is a small problem that crops up when reading the files saved with C#. It throws a numberformatexception when parsing the first character in the file to and int. This character is always a "1". I have opened both files up with editpadlight and they appear to be identical even when viewed with encoding and the encoding is UTF-16LE. I'm racking my brain on this, any help would be appreciated.
lines = FileUtils.readLines(file, "UTF-16LE");
Integer.parseInt(line[0])
I cannot see any difference between the file saved in C# and the one saved in Java
Screen Shot of Data in EditPad Lite
if(lines.get(0).split("\\t")[0].length() == 2){
lines.set(0, lines.get(0).substring(1));
}
Your .NET code is probably writing a BOM. Compliant readers of Unicode, strip off any BOM since it is meta-data, not part of the text data.
Your Java code explicitly specifies the byte order
FileUtils.readLines(file, "UTF-16LE");
It's somewhat of a Catch-22; If the source has a BOM then you can read it as "UTF-16". If it doesn't then you can read it as "UTF-16LE" or "UTF-16BE" as you know which it is.
So, either write it with a BOM and read it without specifying the byte order, or, write it without a BOM and read it specifying the byte order.
With a BOM:
[C#]
File.WriteAllLines(file, lines, Encoding.Unicode);
[Java]
FileUtils.readLines(file, "UTF-16");
Without a BOM:
[C#]
File.WriteAllLines(file, lines, new UnicodeEncoding(false));
[Java]
FileUtils.readLines(file, "UTF-16LE");
In my java code I read the file normally, I just specified char encoding into the InputStreamReader
File file = new File(fileName);
InputStreamReader fis = new InputStreamReader(new FileInputStream(file), "UTF-16LE");
br = new BufferedReader(fis);
String line = br.readLine();

Reading file with bad encoding. CP1252 vs UTF-8

I have byte array, which put in InputStreamReader and do some manipulations with it.
Reader reader = new InputStreamReader(new ByteArrayInputStream(byteArr));
JVM has default cp1252 encoding, but file, which I translating to byte array has utf-8 encoding. Also this file has german umlauts. And when I put byte array in InputStreamReader, java decode umlauts to wrong symbols. For example ü represent as ü. I'm tried to put "UTF-8" and Charset.forName("UTF-8").newDecoder()); to InputStreamReader constructor, translate strings from reader to string with new encoding via new String(oldStr.getBytes("cp1252"), "UTF-8); but it's not helped. In debugger in reader variable I see StreamDecoder parameter, which has "decoder" with MS1252$Decoder value. Maybe It's solving of my problem, but I not understand, how I can fix it.
Try to use InputStreamReader(InputStream in, String charsetName) constructor and set charset by yourself.
Reader reader = new InputStreamReader(new ByteArrayInputStream(byteArr), "UTF-8");
I had exactly the same error and finally solved the issue by adding this to the JVM startup options :
-Dfile.encoding=UTF8

opencsv CSVWriter using utf-8 doesn't seem to work for multiple languages

I have a very annoying encoding problem using opencsv.
When I export a csv file, I set character type as 'UTF-8'.
CSVWriter writer = new CSVWriter(new OutputStreamWriter("D:/test.csv", "UTF-8"));
but when I open the csv file with Microsoft Office Excel 2007, it turns out that it has 'UTF-8 BOM' encoding?
Once I save the file in Notepad and re-open, the file turns back to UTF-8 and all the letters in it appears fine.
I think I've searched enough, but I haven't found any solution to prevent my file from turning into 'UTF-8 BOM'. any ideas, please?
I suppose your file has a 'UTF-8 without BOM' encoding.
You better feed BOM encoding to your file, even though it's not necessary in most cases, but only one obvious exception is when you deal with ms excel.
FileOutputStream os = new FileOutputStream(file);
os.write(0xef);
os.write(0xbb);
os.write(0xbf);
CSVWriter csvWrite = new CSVWriter(new OutputStreamWriter(os));
Now your file will be understood by excel as utf-8 csv.
UTF-8 and UTF-8 Signature (which incorrectly named sometimes as UTF-8 BOM) are same encodings, and signature is used only to distinguish it from any other encodings. Any unicode application should process UTF-8 signature (which is three bytes sequence EF BB BF) correctly.
Why Java is specifically adds this signature and how to stop it doing that I don't know.

Issue encoding java->xls

This is not a pure java question and can also be related to HTML
I've written a java servlet that queries a database table and shows the
result as a html table. The user can also ask to receive the result as
an Excel sheet.
Im creating the Excel sheet by printing the same html table, but with
the content-type of "application/vnd.ms-excel". The Excel file is
created fine.
The problem is that the tables may contain non-english data so I want
to use a UTF-8 encoding.
PrintWriter out = response.getWriter();
response.setContentType("application/vnd.ms-excel:ISO-8859-1");
//response.setContentType("application/vnd.ms-excel:UTF-8");
response.setHeader("cache-control", "no-cache");
response.setHeader("Content-Disposition", "attachment; filename=file.xls");
out.print(src);
out.flush();
The non-english characters appear as garbage (áéíóú)
Also I tried converting to bytes from String
byte[] arrByte = src.getBytes("ISO-8859-1");
String result = new String(arrByte, "UTF-8");
But I Still getting garbage, What can I do?.
Thanks
UPDATE: if I open the excel file in notepad + + the type of file encoding is "UTF-8 without BOM", if I change the encoding to "UTF-8" and then open the file in Excel, the characters "áéíóú" look good.
Excel is a binary format, not a text format, so you should not need to set any encoding, since it simply doesn't apply. Whatever system you are using to build the excel file (e.g. Apache Poi) will take care of the encoding of text within the excel file.
You should not try to convert the recieved bytes to a string, just store them in a byte array or write them out to a file.
EDIT: from the comment, it doesn't sound as if you are using a "real" binary excel file, but a tab delimited text file (CSV). In that case, make sure you use consistent encoding, e.g UTF-8 throughout.
Also, before calling response.getWriter(), call setContentType first.
See HttpServletResponse.getPrintWriter()
EDIT: You can try writing the BOM. It's normally not required, but file format handling in Office is far from normal...
Java doesn't really have support for the BOM. You'll have to fake it. It means that you need to use the response outputStream rather than writer, since you need to write raw bytes (the BOM). So you change your code to this:
response.setContentType("application/vnd.ms-excel:UTF-8");
// set other headers also, "cache-control" etc..
OutputStream outputStream = response.getOutputStream();
outputStream.write(0xEF); // 1st byte of BOM
outputStream.write(0xBB);
outputStream.write(0xBF); // last byte of BOM
// now get a PrintWriter to stream the chars.
PrintWriter out = new PrintWriter(new OutputStreamWriter(outputStream,"UTF-8"));
out.print(src);
Do you get "garbage" when you print result to standard output?
Edit (code in code tags from the comment below):
response.setContentType("application/vnd.ms-excel; charset=UTF-8")
Try using the ServletResponse.setCharacterEncoding(java.lang.String charset) method.
response.setCharacterEncoding("UTF-8");
I had the same issue.. i fixed it with using print() instead of write()
outputStream.print('\ufeff');

BufferedReader returns ISO-8859-15 String - how to convert to UTF16 String?

I have an FTP client class which returns InputStream pointing the file. I would like to read the file row by row with BufferedReader. The issue is, that the client returns the file in binary mode, and the file has ISO-8859-15 encoding.
If the file/stream/whatever really contains ISO-8859-15 encoded text, you just need to specify that when you create the InputStreamReader:
BufferedReader br = new BufferedReader(
new InputStreamReader(ftp.getInputStream(), "ISO-8859-15"));
Then readLine() will create valid Strings in Java's native encoding (which is UTF-16, not UTF-8).
Try this:
BufferedReader br = new BufferedReader(
new InputStreamReader(
ftp.getInputStream(),
Charset.forName("ISO-8859-15")
)
);
String row = br.readLine();
The original string is in ISO-8859-15, so the byte stream read by your InputStreamReader will be in this encoding. So read in using that encoding (specify this in the InputStreamReader constructor). That tells the InputStreamReader that the incoming byte stream is in ISO-8859-15 and to perform the appropriate byte-to-character conversions.
Now it will be in the standard Java UTF-16 format, and you can then do what you wish.
I think the current problem is that you're reading it using your default encoding (by not specifying an encoding in InputStreamReader), and then trying to convert it, by which time it's too late.
Using default behaviour for these sort of classes often ends in grief. It's a good idea to specify encodings wherever you can, and/or default the VM encoding via -Dfile.encoding
Have you tried:
BufferedReader r = new BufferedReader(new InputStreamReader("ISO-8859-1"))
...

Categories