BufferedOutputStream not working with Korean characters as expected - java

I'm trying to write Korean characters to a File and it's writing some gibberish data which I need to work around for showing as Korean data when I open it in CSV. How can I achieve my requirement without the workaround of decoding back to UTF-8 and show Korean data.
File localExport = File.createTempFile("char-test", ".csv");
try (
FileOutputStream fos = new FileOutputStream(localExport);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter outputStreamWriter =
new OutputStreamWriter(bos, StandardCharsets.UTF_8)
) {
ArrayList<String> rows = new ArrayList<>();
rows.add("\"가짜 사용자\",사용자123,saint1_user123");
rows.add("\"페이크유저루노도스트레스 성도1\",saint1_user1");
for (int i=0; i<2; i++) {
String csvUserStr = rows.get(i);
outputStreamWriter.write(csvUserStr);
}
}
It's writing the below data instead of the one I'm actually writing to the File.

There is absolutely nothing wrong with your java code. You are writing those characters, including the korean, precisely as written.
Whatever tool you are using to look at this file?
That's the broken one. Tell it that the file is UTF-8 based. If you can't, get a better tool or figure out which encoding it reads in, and update your java code.
Note that CSV files, text files, etc - they do not store the encoding that was used to write the data. All the programs that read/write to the file need to just know what encoding it is, there's no real way to know other than being told.
UPDATE: From a comment it looks like 'the tool that is reading this' is excel.
Excel asks for the encoding of the file when you use the 'import CSV' dialog. Pick UTF-8 in the dropdown. Depends on which version/OS you're on, but usually it's called 'File Origin'.
If you prefer that your client need not mess with the default, usually the default is something like MacRoman or Win1282, and with such an encoding, it is in fact impossible get korean characters. They simply aren't in that set.
if you want the fire and forget approach, generate the excel file yourself, for example using Apache POI.

CSV files don't have any means to carry encoding information "in-band"—in the file itself. I'm guessing the default character encoding used for Excel CSV imports is the system default, so if that isn't Korean, they will have to specify the encoding when they import the CSV. If your client requires CSV, they have no choice but to accept that behavior.
However, if their requirement is to open your file in Excel (and not that the file has to be CSV format), you could write an Excel spreadsheet instead. The various Excel file formats do include character encoding information, so they would be able to open the file without manually specifying the encoding.
Library recommendations are off-topic, but libraries such Apache POI make writing simple Excel sheets fairly easy. There are additional benefits as well, such as taking care of any necessary escaping for you, so that your file doesn't repeatedly break when unanticipated values are included in the spreadsheet.

As mentioned Excel fails to detect that the text is encoded in UTF-8. One solution is to write an invisible BOM character as first one:
outputStreamWriter.write("\uFEFF");
for...
This is a normally superfluous and ugly marker for miscellaneous UTF encoding.
By the way take a look at the class Files, that can reduce the code to one line.

Related

How to read textfiles with unknown encoding?

I want to read several text files (eg CSV), but I don't know the encoding.
As the textfiles may contain special chars like umlauts, chosing the right encoding seems to be crucial.
new BufferedReader(new InputStreamReader(resource.getInputStream(), encoding));
I tried reading with ISO_8859_1 which did not work propertly with umlauts encoded. So I tried UTF-8, which works.
But I don't know in future if this might also cause problems with different files. And I never now before reading a file in which encoding the file is.
So how should I best read files with encoding unknown?
Strictly speaking the other two answers are right - you just have to know what the encoding is to be guaranteed of anything. However, there are libraries out there that will allow you to make educated guesses about the encoding. Check out ICU4J or jchardet, for example.
You have to know the encoding, you cannot read the files correctly if you don't know it. As UTF-8 works just keep using it. Also check with the producer of the files if they will keep producing them in UTF-8. They should document this.
It is impossible to programmatically recognize encoding of a text file. The only way is to try to open it in a text editor with different encodings until you can read the text

Is it possible to create InputStream for a UTF-8 file?

We are making some code change to our production code.
In this, earlier we used a InputStream (Basically a FileInputStream) for reading a file from file path and this InputStream is passed to many methods ahead.
Now we realized, the file can contain chinese characters also, so we want to use UTF-8 encoding.
I have a file path in string. And know, sometimes the file can contain chinese character and sometimes not.
I am reluctant to make changes in so many methods and was trying to somehow use UTF-8 encoding while creating InputStream (FileInputStream).
I searched on internet but all I could get is output in bufferreader/inputstream reader (like example Reading InputStream as UTF-8 or http://www.mkyong.com/java/how-to-read-utf-8-encoded-data-from-a-file-java/
So is it possible to read a file from file path and also handle chinese characters and convert it in InputStram?
An InputStream does not handle text, so it does not care about the encoding, so the direct answer to your question is: no, you can't create an InputStream with UTF-8 encoding.
You can however handle UTF-8 files just fine with an InputStream by simply carrying the bytes around and never manipulating them in any way.
If you want to read text from a file you need to construct a Reader and then you'll need to specify the encoding (UTF-8 for you) in the constructor.
If you show us the point where data from the InputStream gets turned into String or char[] objects, then I can show you the place where you need to change your code.

Character encoding issue while reading an Excel file in a Java Web App

In a Java Web app, I'm using the JExcel API to read Excel files sent by clients.
I'm doing something like this :
byte[] excelFile = ...
InputStream inputStream = new ByteArrayInputStream(excelFile);
WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("CP1252");
Workbook w = Workbook.getWorkbook(inputStream, ws);
...
Struts gives me the Excel file as a byte array (I use the FormFile#getFileData() method).
It works OK on Windows. However this is quite different on Linux. While cells can be parsed correctly and their content well interpreted (even if there is some non ASCII characters like 'à', 'ê', etc), sheet names does not. I get some bad characters like '?' or '�'.
I forced workbook encoding to UTF-8 :
ws.setEncoding("UTF-8");
but there is no effect.
I changed the Excel file to UTF-8 too, nothing happens. I really don't understand why it does not work, especially sheet names, since the whole chain is in UTF-8 (I have a Servlet Filter which forces HTTP requests encoding to UTF-8 too).
I had a similar problem but with another java excel api. The problem is that excel tries to be smart and replace some characters for you. An example of this in my case would be that excel replaced three dots '...' with a singe character representing three dots out of it's own character set which is non-standard UTF-8. My framework didn't recognize it and I got similar undefined character (�') as you are now getting. To fix this I had to manually edit all the excel spreadsheets and then it worked ok. The big problem I had was finding which characters it was. I am not sure if this is an option for you though.
It seems to be a bug of the JXL version I am using. Indeed, if I upgrade the JAR to the last version, the problem does not occur.

Excel or text file, which one to use?

I need to suggest an input, excel file or text file.
assuming the input is large number of lines where I need to read the first String, for example:
A,B,C,D....
I need to read the first String (in this case A) to identify the matching row, should I use excel file and use POI to read the first cell of each row? or text file where each line tokens are separated by delimiter and to parse each line reading the first token.
Use a text file. Because computers like it more. If business requires it, rename that text file into a "csv" file and you've got an Excel file.
If humans are going to enter data then use Excel. If the file is used as a communication channel between two systems use as simple as possible file.
If at all possible, use text file - much easier to handle/troubleshoot, easier to generate, uses less memory, does not have restrictions on number of rows, etc. In general - more predictable.
If you go with text files and you have people manually preparing those text files, and you are dealing with non-ASCII text, you better make sure everybody will send you the files in correct encoding (usually UTF-8 would be the best). This is not an issue with Excel.
The only reason to use Excel workbook would be when you need some "business-people" to produce those input files, then that input effectively becomes a user interface to your system - Excel is usually considered more user friendly than Notepad. ;-)
If you do go with Excel, make sure that the people producing those Excel files will give you the correct version (I assume you would want the "old" XLS format, not the new XLSX format).
Rule of thumb: use a text file. It's more interchangeable and way easier to handle by any other software you may need to support in a few years.
If you need some humans to edit those data and you need some beautiful/color display the Excel can provide, consider creating a macro that would store data in csv.

How to find if the file is a CSV file?

I have a scenario wherein the user uploads a file to the system. The only file that the system understands in a CSV, but the user can upload any type of file eg: jpeg, doc, html. I need to throw an exception if the user uploads anything other than CSV file.
Can anybody let me know how can I find if the uploaded file is a CSV file or not?
CSV files vary a lot, and they all could be called, legitimately, CSV files.
I guess your approach is not the best one, the correct approach would be to tell if the uploaded file is a text file the application can parse instead of it it's a CSV or not.
You would report errors whenever you can't parse the file, be it a JPG, MP3 or CSV in a format you cannot parse.
To do that, I would try to find a library to parse various CSV file formats, else you have a long road ahead writing code to parse many possible types of CSV files (or restricting the application's flexibility by supporting few CSV formats.)
One such library for Java is opencsv
If you're using some library CSV parser, all you would have to do is catch any errors it throws.
If the CSV parser you're using is remotely robust, it will throw some useful errors in the event that it doesn't understand the file format.
I can think of several methods.
One way is to try to decode the file using UTF-8. (This is built into Java and is probably built into .NET too.) If the file decodes properly, then you at least know that it's a text file of some kind.
Once you know it's a text file, parse out the individual fields from each line and check that you get the number of fields that you expect. If the number of fields per line is inconsistent then you might just have a file that contains text but is not organized into lines and fields.
Otherwise you have a CSV. Then you can validate the fields.
If it's a web application, you might want to check the content-type HTTP header the browser sends when uploading/posting a file through a form.
If there's a bind for the language you're using, you might also try using libmagic, is pretty good at recognizing file types. For example, the UNIX tool file uses it.
http://sourceforge.net/projects/libmagic/
I don't know if you can tell for 100% certain in any way, but I'd suggest that the first validations should be:
Is the file extension .csv
Count the number of commas in the file per line, there should normally be the same amount of commas on each line of the file for it to be a valid CSV file. (As Jkramer said, this only works if the files can't contain quoted commas).
try this one :
String type = Files.probeContentType(Paths.get(filepath));
I solved it like this: read the file with UTF-16 encoding, if no comma is found in the file, it means UTF-16 encoding didnt work. Which means that this csv file is of Excel format (NOT plain text).
if(fileA.endsWith(".csv") && fileB.endsWith(".csv")) {
second_list=readCSVFile(fileA);
new_list=readCSVFile(fileB);
if(!String.join("", second_list).contains(",") || !String.join("", new_list).contains(",")) {
//read these files with UTF-8 encoding
System.out.println("[WARN] csv files will be read like text files. (UTF-16 encoding couldnt find any comma in the file i.e., UTF-16 encoding didn't work)");
second_list=readFile(fileA);
new_list=readFile(fileB);
} else {
// keep the csv as UTF-16 encoded
}

Categories