How to read textfiles with unknown encoding?

How to read textfiles with unknown encoding? - java

I want to read several text files (eg CSV), but I don't know the encoding.
As the textfiles may contain special chars like umlauts, chosing the right encoding seems to be crucial.
new BufferedReader(new InputStreamReader(resource.getInputStream(), encoding));
I tried reading with ISO_8859_1 which did not work propertly with umlauts encoded. So I tried UTF-8, which works.
But I don't know in future if this might also cause problems with different files. And I never now before reading a file in which encoding the file is.
So how should I best read files with encoding unknown?

Strictly speaking the other two answers are right - you just have to know what the encoding is to be guaranteed of anything. However, there are libraries out there that will allow you to make educated guesses about the encoding. Check out ICU4J or jchardet, for example.

You have to know the encoding, you cannot read the files correctly if you don't know it. As UTF-8 works just keep using it. Also check with the producer of the files if they will keep producing them in UTF-8. They should document this.

It is impossible to programmatically recognize encoding of a text file. The only way is to try to open it in a text editor with different encodings until you can read the text

Related

BufferedOutputStream not working with Korean characters as expected

I'm trying to write Korean characters to a File and it's writing some gibberish data which I need to work around for showing as Korean data when I open it in CSV. How can I achieve my requirement without the workaround of decoding back to UTF-8 and show Korean data.
File localExport = File.createTempFile("char-test", ".csv");
try (
FileOutputStream fos = new FileOutputStream(localExport);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter outputStreamWriter =
new OutputStreamWriter(bos, StandardCharsets.UTF_8)
) {
ArrayList<String> rows = new ArrayList<>();
rows.add("\"가짜 사용자\",사용자123,saint1_user123");
rows.add("\"페이크유저루노도스트레스 성도1\",saint1_user1");
for (int i=0; i<2; i++) {
String csvUserStr = rows.get(i);
outputStreamWriter.write(csvUserStr);
}
}
It's writing the below data instead of the one I'm actually writing to the File.

There is absolutely nothing wrong with your java code. You are writing those characters, including the korean, precisely as written.
Whatever tool you are using to look at this file?
That's the broken one. Tell it that the file is UTF-8 based. If you can't, get a better tool or figure out which encoding it reads in, and update your java code.
Note that CSV files, text files, etc - they do not store the encoding that was used to write the data. All the programs that read/write to the file need to just know what encoding it is, there's no real way to know other than being told.
UPDATE: From a comment it looks like 'the tool that is reading this' is excel.
Excel asks for the encoding of the file when you use the 'import CSV' dialog. Pick UTF-8 in the dropdown. Depends on which version/OS you're on, but usually it's called 'File Origin'.
If you prefer that your client need not mess with the default, usually the default is something like MacRoman or Win1282, and with such an encoding, it is in fact impossible get korean characters. They simply aren't in that set.
if you want the fire and forget approach, generate the excel file yourself, for example using Apache POI.

CSV files don't have any means to carry encoding information "in-band"—in the file itself. I'm guessing the default character encoding used for Excel CSV imports is the system default, so if that isn't Korean, they will have to specify the encoding when they import the CSV. If your client requires CSV, they have no choice but to accept that behavior.
However, if their requirement is to open your file in Excel (and not that the file has to be CSV format), you could write an Excel spreadsheet instead. The various Excel file formats do include character encoding information, so they would be able to open the file without manually specifying the encoding.
Library recommendations are off-topic, but libraries such Apache POI make writing simple Excel sheets fairly easy. There are additional benefits as well, such as taking care of any necessary escaping for you, so that your file doesn't repeatedly break when unanticipated values are included in the spreadsheet.

As mentioned Excel fails to detect that the text is encoded in UTF-8. One solution is to write an invisible BOM character as first one:
outputStreamWriter.write("\uFEFF");
for...
This is a normally superfluous and ugly marker for miscellaneous UTF encoding.
By the way take a look at the class Files, that can reduce the code to one line.

Is it possible to create InputStream for a UTF-8 file?

We are making some code change to our production code.
In this, earlier we used a InputStream (Basically a FileInputStream) for reading a file from file path and this InputStream is passed to many methods ahead.
Now we realized, the file can contain chinese characters also, so we want to use UTF-8 encoding.
I have a file path in string. And know, sometimes the file can contain chinese character and sometimes not.
I am reluctant to make changes in so many methods and was trying to somehow use UTF-8 encoding while creating InputStream (FileInputStream).
I searched on internet but all I could get is output in bufferreader/inputstream reader (like example Reading InputStream as UTF-8 or http://www.mkyong.com/java/how-to-read-utf-8-encoded-data-from-a-file-java/
So is it possible to read a file from file path and also handle chinese characters and convert it in InputStram?

An InputStream does not handle text, so it does not care about the encoding, so the direct answer to your question is: no, you can't create an InputStream with UTF-8 encoding.
You can however handle UTF-8 files just fine with an InputStream by simply carrying the bytes around and never manipulating them in any way.
If you want to read text from a file you need to construct a Reader and then you'll need to specify the encoding (UTF-8 for you) in the constructor.
If you show us the point where data from the InputStream gets turned into String or char[] objects, then I can show you the place where you need to change your code.

Java encodings for Japanese

Our software has a script that creates different language JAR files, for Japanese we use the encoding SJIS in a call to native2asci. This worked last time a Japanese build was attempted but now seems to only work in certain contexts. For example in the following dialog the encoding seems to only work in the title bar:
Anyone have any idea about what might be causing this? Could this problem be related to a change in Java?

What exactly do you pass through native2ascii? Just to make sure, you're using native2ascii -encoding Shift_JIS, right? And you're passing text files or source files through native2ascii, right?
My only other idea is that after the text has been converted to \uXXXX format, the font you're using to display the dialog may not have all the Kanji and Kana. Explicitly set a font, and try that.

I would suggest checking these 2 things:
Make absolutely sure that the native2ascii conversions are correct. You should do a round trip conversion with the -reverse flag, and make sure that your input and output are in sync.
Double-check that your fonts used can support Shift-JIS. Those blocks and symbols that appear in the dialog text and button text look like the characters might be OK, but the fonts might not support them.
An additional word of caution: If this application is intended for use on Windows, then you really should be using the MS932 or windows-31j encoding. SJIS will work for all but a dozen or so symbols, but it turns out these symbols (like the full-width tilde) are actually used quite frequently in Japan.

I think the right way to do this is to use UTF-8 or UTF-16 exclusively. Kanji and Katakana demand special attention.

File upload-download in its actual format

I've to make a code to upload/download a file on remote machine. But when i upload the file new line is not saved as well as it automatically inserts some binary characters. Also I'm not able to save the file in its actual format, I've to save it as "filename.ser". I'm using serialization-deserialization concept of java.
Thanks in advance.

How exactly are you transmitting the files? If you're using implementations of InputStream and OutputStream, they work on a byte-by-byte level so you should end up with a binary-equal output.
If you're using implementations of Reader and Writer, they convert the bytes to characters according to some character mapping, and then perform the reverse process when saving. Depending on the platform encodings of the various machines (and possibly other effects if you're not specifying the charset explicitly), you could well end up with differences in the binary file.
The fact that you mention newlines makes me think that you're using Readers to send strings (and possibly that you're stitching the strings back together yourself by manually adding newlines). If you want the files to be binary equal, then send them as a stream of bytes and store that stream verbatim. If you want them to be equal as strings in a given character set, then use Readers and Writers but specify the character set explicitly. If you want them to be transmitted as strings in the platform default set (not very useful), then accept that they're not going to be binary equal as files.
(Also, your question really doesn't provide much information to solve it. To me, it basically reads "I wrote some code to do X, and it doesn't work. Where did I go wrong?" You seem to assume that your code is correct by not listing it, but at the same time recognise that it's not...)

How to get correct encoding?

I have utf-8 file which I want to read and display in my java program.
In eclipse console(stdout) or in swing I'm getting question marks instead of correct characters.
BufferedReader fr = new BufferedReader(
new InputStreamReader(
new FileInputStream(f),"UTF-8"));
System.out.println(fr.readLine());
inpuStreamReader.getEncoding() //prints me UTF-8
I generally don't have problem displaying accented letters either on the linux console or firefox etc.
Why is that so? I'm ill from this :/
thank you for help

I'm not a Java expert, but it seems like you're creating a UTF-8 InputStreamReader with a file that's not necessarily UTF-8.
See also: Java : How to determine the correct charset encoding of a stream

It sounds like the Eclipse console is not processing UTF-8 characters, and/or the font configured for that console does not support the Unicode characters you are trying to display.
You might be able to get this to work if you configure Eclipse to expect UTF-8 characters, and also make sure that the font in use can display those Unicode characters that are encoded in your file.
From the Eclipse 3.1 New and Noteworthy page:
You can configure the console to
display output using a character
encoding different from the default
using the Console Encoding settings on
the Common tab of a launch
configuration.
As for Swing, I think you're going to need to select the right font.

There are several parameters at work, when the system has to display Unicode characters -
The first and foremost that comes to the mind, is the encoding of the input stream or buffer, which you've already figured out.
The next one in the list is the Unicode capabilities of the application - Eclipse does support display of Unicode characters in the console output; with a workaround :).
The last one in my mind is that of the font used in you console output - not all fonts come with glyphs for displaying Unicode characters.
Update
The non-display of Unicode characters is most likely due to the fact that Cp1252 is used for encoding characters in the console output. This can be modified by visiting the Run configuration of the application - it appears in the Common tab of the run-time configuration.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read textfiles with unknown encoding? - java

Strictly speaking the other two answers are right - you just have to know what the encoding is to be guaranteed of anything. However, there are libraries out there that will allow you to make educated guesses about the encoding. Check out ICU4J or jchardet, for example.

You have to know the encoding, you cannot read the files correctly if you don't know it. As UTF-8 works just keep using it. Also check with the producer of the files if they will keep producing them in UTF-8. They should document this.

It is impossible to programmatically recognize encoding of a text file. The only way is to try to open it in a text editor with different encodings until you can read the text

Related

BufferedOutputStream not working with Korean characters as expected

Is it possible to create InputStream for a UTF-8 file?

Java encodings for Japanese

File upload-download in its actual format

How to get correct encoding?

Categories

Resources