Find out encoding directly from an input stream [duplicate] - java

I'm facing a problem.
A file can be written in some encoding such as UTF-8, UTF-16, UTF-32, etc.
When I read a UTF-16 file, I use the code below:
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(file), "UTF16"));
How can I determine which encoding the file is in before I read the file ?
When I read UTF-8 encoded file using UTF-16 I can't read the characters correctly.

There is no good way to do that. The question you're asking is like determining the radix of a number by looking at it. For example, what is the radix of 101?
Best solution would be to read the data into a byte array. Then you can use String(byte[] bytes, Charset charset) to test it with multiple encodings, most likely to least likely.

You cannot. Which transformation format applies is usually determined by the first four bytes of the file (assuming a BOM). You cannot see those just from the outside.

You can read the first few bytes and try to guess the encoding.
If all else fails, try reading with different encodings until one works (no exception when decoding and it 'looks' OK).

Related

Fix mixed encoding in string

I have a file which contains the following string:
AAdοbe Dοcument Clοud
if viewed in Notepad++. In hex view the string looks like this:
If I read the file with Java the string looks like this:
AAdοbe Dοcument Clοud
How I can get the same encoding in Java as with Notepad++?
Your file is encoded as UTF-8, and the CE BF bytes is the UTF-8 encoding of the character ο ('GREEK SMALL LETTER OMICRON' (U+03BF)).
If you use the Encoding pull-down menu in Notepad++ to specify UTF-8, you should see the content as:
AAdοbe Dοcument Clοud
You might want to replace those Greek ο's with regular Latin o's ('LATIN SMALL LETTER O' (U+006F)).
If you decide to keep the Greek ο's, you need to make sure your Java program reads the file using UTF-8, which is best done using one of these:
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt")); // UTF-8 is the default
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt"), StandardCharsets.UTF_8);
If you look at the text with a debugger, you should see that it is now read correctly. If you print the text, make sure the console window you're using can handle UTF-8 characters, otherwise it might just print wrong, even though it was read correctly.
You must set encoding in file reader ilke this.
new FileReader(fileName, StandardCharsets.UTF_8)
You must read the file in java using the same encoding as the file has.
If you are working with non standard encodings, even trying to read the encoding with something like:
InputStreamReader r = new InputStreamReader(new FileInputStream(theFile));
r.getEncoding()
Can output with wrong values.
There's little library which handles recognition of encoding a bit better: https://code.google.com/archive/p/juniversalchardet/
It also has some holes in obtaining proper encoding, but I've used it.
And while using it I found out that most of non-standard encodings can be read with UTF-16 like:
new FileReader(fileName, StandardCharsets.UTF_16)
Since a while, Java supports usage of UTF-16 encoding. It's defined in Java standard API as StandardCharsets.UTF_16. That character set covers lots of language specific characters and emojis.

Writing unicode to rtf file

I´m trying write strings in diffrent languages to a rtf file. I hav tried a few different things.
I use japanese here as an example but it´s the same for other languages i have tried.
public void writeToFile(){
String strJapanese = "日本語";
DataOutputStream outStream;
File file = new File("C:\\file.rtf");
try{
outStream = new DataOutputStream(new FileOutputStream(file));
outStream.writeBytes(strJapanese);
outStream.close();
}catch (Exception e){
System.out.println(e.toString());
}
}
I alse have tried:
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
Or more specific:
byte[] b = strJapanese.getBytes("Shift-JIS");
String output = new String(b);
The output stream also has the writeUTF method:
outStream.writeUTF(strJapanese);
You can use the byte[] directly in the output stream with the write method. All of the above gives me garbled characters for everything except west european languages. To see if it works I have tried opening the result document in notepad++ and set the appropriate encoding. Also i have used OpenOffice where you get to choose encoding and font when opening the document.
If it does work but my computer can´t open it properly, is there a way to check that?
By default stings in JAVA are in UTF-8 (unicode), but when you want to write it down you need to specify encoding
try {
FileOutputStream fos = new FileOutputStream("test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
ref: http://download.oracle.com/javase/tutorial/i18n/text/stream.html
DataOutputStream outStream;
You probably don't want a DataOutputStream for writing an RTF file. DataOutputStream is for writing binary structures to a file, but RTF is text-based. Typically an OutputStreamWriter, setting the appropriate charset in the constructor would be the way to write to text files.
outStream.writeBytes(strJapanese);
In particular this fails because writeBytes really does write bytes, even though you pass it a String. A much more appropriate datatype would have been byte[], but that's just one of the places where Java's handling of bytes vs chars is confusing. The way it converts your string to bytes is simply by taking the lower eight bits of each UTF-16 code unit, and throwing the rest away. This results in ISO-8859-1 encoding with garbled nonsense for all the characters that don't exist in ISO-8859-1.
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
This doesn't really do anything useful. You encode to UTF-8 bytes and than decode that back to a String using the default charset. It's almost always a mistake to touch the default charset as it is unpredictable over different machines.
outStream.writeUTF(strJapanese);
This would be a better stab at writing UTF-8, but it's still not quite right as it uses Java's bogus “modified UTF-8” encoding, and more importantly RTF files don't actually support UTF-8, and shouldn't really directly include any non-ASCII characters at all.
Traditionally non-ASCII characters from 128 upwards should be written as hex bytes escapes like \'80, and the encoding for them is specified, if it is at all, in font \fcharset and \cpg escapes that are very, very annoying to deal with, and don't offer UTF-8 as one of the options.
In more modern RTF, you get \u1234x escapes as in Dabbler's answer (+1). Each escape encodes one UTF-16 code unit, which corresponds to a Java char, so it's not too difficult to regex-replace all non-ASCII characters with their escaped variants.
This is supported by Word 97 and later but some other tools may ignore the Unicode and fall back to the x replacement character.
RTF is not a very nice format.
You can write any Unicode character expressed as its decimal number by using the \u control word. E.g. \u1234? would represent the character whose Unicode code point is 1234, and ? is the replacement character for cases where the character cannot be adequadely represented (e.g. because the font doesn't contain it).

Why character streams?

I understand that Java character streams wrap byte streams such that the underlying byte stream is interpreted as per the system default or an otherwise specifically defined character set.
My systems default char-set is UTF-8.
If I use a FileReader to read in a text file, everything looks normal as the default char-set is used to interpret the bytes from the underlying InputStreamReader. If I explicitly define an InputStreamReader to read the UTF-8 encoded text file in as UTF-16, everything obviously looks strange. Using a byte stream like FileInputStream and redirecting its output to System.out, everything looks fine.
So, my questions are;
Why is it useful to use a character stream?
Why would I use a character stream instead of directly using a byte stream?
When is it useful to define a specific char-set?
Code that deals with strings should only "think" in terms of text - for example, reading an input source line by line, you don't want to care about the nature of that source.
However, storage is usually byte-oriented - so you need to create a conversion between the byte-oriented view of a source (encapsulated by InputStream) and the character-oriented view of a source (encapsulated by Reader).
So a method which (say) counts the lines of text in an input source should take a Reader parameter. If you want to count the lines of text in two files, one of which is encoded in UTF-8 and one of which is encoded in UTF-16, you'd create an InputStreamReader around a FileInputStream for each file, specifying the appropriate encoding each time.
(Personally I would avoid FileReader completely - the fact that it doesn't let you specify an encoding makes it useless IMO.)
An InputStream reads bytes, while a Reader reads characters. Because of the way bytes map to characters, you need to specify the character set (or encoding) when you create an InputStreamReader, the default being the platform character set.
When you are reading/writing text which contains characters which could be > 127 , use a char stream. When you are reading/writing binary data use a byte stream.
You cna read text as binary if you wish, but unless you make alot of assumptions it rarely gains you much.

Readline() in Java does not handle Chinese characters properly

I have a text file with Chinese words written to a line. The line is surrounded with "\r\n", and written using fileOutputStream.write(string.getBytes()).
I have no problems reading lines of English words, my buffered reader parses it with readLine() perfectly. However, it recognizes the Chinese sentence as multiple lines, thus screwing up my programme flow.
Any solutions?
Using string.getBytes() encodes the String using the platform default encoding. That is rarely what you want, especially when you're trying to write characters that are not native to your current locale.
Specify the encoding instead (using string.getBytes("UTF-8"), for example).
A cleaner and more Java-esque way would be to wrap your OutputStream in an OutputStreamWriter like this:
Writer w = new OutputStreamWriter(out, "UTF-8");
Then you can simply call writer.write(string) and don't need to repeat the encoding each time you want to write a String.
And, as commented below, specify the same encoding when reading the file (using a Reader, preferably).
If you're outputting the text via fileOutputStream.write(string.getBytes()), you're outputting with the default encoding for the platform. It's important to ensure you're then reading with the appropriate encoding, and using methods that are encoding-aware. The problem won't be in your BufferedReader instance, but whatever Reader you have under it that's converting bytes into characters.
This article may be of use: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Character encoding UTF and ISO-8859-1 in CSV [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to add a UTF-8 BOM in java
My oracle database has a character set of UTF8.
I have a Java stored procedure which fetches record from the table and creates a csv file.
BLOB retBLOB = BLOB.createTemporary(conn, true, BLOB.DURATION_SESSION);
retBLOB.open(BLOB.MODE_READWRITE);
OutputStream bOut = retBLOB.setBinaryStream(0L);
ZipOutputStream zipOut = new ZipOutputStream(bOut);
PrintStream out = new PrintStream(zipOut,false,"UTF-8");
The german characters(fetched from the table) becomes gibberish in the csv if I use the above code. But if I change the encoding to use ISO-8859-1, then I can see the german characters properly in the csv file.
PrintStream out = new PrintStream(zipOut,false,"ISO-8859-1");
I have read in some posts which says that we should use UTF8 as it is safe and will also encode other language (chinese etc) properly which ISO-8859-1 will fail to do so.
Please suggest me which encoding I should use. (There are strong chances that we might have chinese/japanese words stored in the table in the future.)
You're currently only talking about one part of a process that is inherently two-sided.
Encoding something to bytes is only really relevant in the sense that some other process comes along and decodes it back into text at some later point. And of course, both processes need to use the same character set else the decode will fail.
So it sounds to me that the process that takes the BLOB out of the database and into the CSV file, is assuming that the bytes are an ISO-8859-1 encoding of text. Hence if you store them as UTF-8, the decoding messes (though the basic ASCII characters have the same byte representation in both, which is why they still decode correctly).
UTF-8 is a good character set to use in almost all circumstances, but it's not magic enough to overcome the immutable law that the same character set must be used for decoding as was used for encoding. So you can either change your CSV-creator to decode with UTF-8, else you'll have to continue encoding with ISO-8859-1.
I suppose your BLOB data is ISO-8859-1 encoded. As it's stored as binary and not as text its encoding is not depended on the databases encoding. You should check if the the BLOB was originaly written in UTF-8 encoding and if not, do so.
I think the problem is [Excel]csv could not figure out the utf8 encoding.
utf-8 csv issue
But I m still not able to resolve the issue even if I put a BOM on the PrintStream.
PrintStream out = new PrintStream(zipOut,false,"UTF-8");
out.write('\ufeff');
I also tried:
out.write(new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF });
but to no avail.

Categories