Writing unicode to rtf file - java

I´m trying write strings in diffrent languages to a rtf file. I hav tried a few different things.
I use japanese here as an example but it´s the same for other languages i have tried.
public void writeToFile(){
String strJapanese = "日本語";
DataOutputStream outStream;
File file = new File("C:\\file.rtf");
try{
outStream = new DataOutputStream(new FileOutputStream(file));
outStream.writeBytes(strJapanese);
outStream.close();
}catch (Exception e){
System.out.println(e.toString());
}
}
I alse have tried:
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
Or more specific:
byte[] b = strJapanese.getBytes("Shift-JIS");
String output = new String(b);
The output stream also has the writeUTF method:
outStream.writeUTF(strJapanese);
You can use the byte[] directly in the output stream with the write method. All of the above gives me garbled characters for everything except west european languages. To see if it works I have tried opening the result document in notepad++ and set the appropriate encoding. Also i have used OpenOffice where you get to choose encoding and font when opening the document.
If it does work but my computer can´t open it properly, is there a way to check that?

By default stings in JAVA are in UTF-8 (unicode), but when you want to write it down you need to specify encoding
try {
FileOutputStream fos = new FileOutputStream("test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
ref: http://download.oracle.com/javase/tutorial/i18n/text/stream.html

DataOutputStream outStream;
You probably don't want a DataOutputStream for writing an RTF file. DataOutputStream is for writing binary structures to a file, but RTF is text-based. Typically an OutputStreamWriter, setting the appropriate charset in the constructor would be the way to write to text files.
outStream.writeBytes(strJapanese);
In particular this fails because writeBytes really does write bytes, even though you pass it a String. A much more appropriate datatype would have been byte[], but that's just one of the places where Java's handling of bytes vs chars is confusing. The way it converts your string to bytes is simply by taking the lower eight bits of each UTF-16 code unit, and throwing the rest away. This results in ISO-8859-1 encoding with garbled nonsense for all the characters that don't exist in ISO-8859-1.
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
This doesn't really do anything useful. You encode to UTF-8 bytes and than decode that back to a String using the default charset. It's almost always a mistake to touch the default charset as it is unpredictable over different machines.
outStream.writeUTF(strJapanese);
This would be a better stab at writing UTF-8, but it's still not quite right as it uses Java's bogus “modified UTF-8” encoding, and more importantly RTF files don't actually support UTF-8, and shouldn't really directly include any non-ASCII characters at all.
Traditionally non-ASCII characters from 128 upwards should be written as hex bytes escapes like \'80, and the encoding for them is specified, if it is at all, in font \fcharset and \cpg escapes that are very, very annoying to deal with, and don't offer UTF-8 as one of the options.
In more modern RTF, you get \u1234x escapes as in Dabbler's answer (+1). Each escape encodes one UTF-16 code unit, which corresponds to a Java char, so it's not too difficult to regex-replace all non-ASCII characters with their escaped variants.
This is supported by Word 97 and later but some other tools may ignore the Unicode and fall back to the x replacement character.
RTF is not a very nice format.

You can write any Unicode character expressed as its decimal number by using the \u control word. E.g. \u1234? would represent the character whose Unicode code point is 1234, and ? is the replacement character for cases where the character cannot be adequadely represented (e.g. because the font doesn't contain it).

Related

Fix mixed encoding in string

I have a file which contains the following string:
AAdοbe Dοcument Clοud
if viewed in Notepad++. In hex view the string looks like this:
If I read the file with Java the string looks like this:
AAdοbe Dοcument Clοud
How I can get the same encoding in Java as with Notepad++?
Your file is encoded as UTF-8, and the CE BF bytes is the UTF-8 encoding of the character ο ('GREEK SMALL LETTER OMICRON' (U+03BF)).
If you use the Encoding pull-down menu in Notepad++ to specify UTF-8, you should see the content as:
AAdοbe Dοcument Clοud
You might want to replace those Greek ο's with regular Latin o's ('LATIN SMALL LETTER O' (U+006F)).
If you decide to keep the Greek ο's, you need to make sure your Java program reads the file using UTF-8, which is best done using one of these:
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt")); // UTF-8 is the default
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt"), StandardCharsets.UTF_8);
If you look at the text with a debugger, you should see that it is now read correctly. If you print the text, make sure the console window you're using can handle UTF-8 characters, otherwise it might just print wrong, even though it was read correctly.
You must set encoding in file reader ilke this.
new FileReader(fileName, StandardCharsets.UTF_8)
You must read the file in java using the same encoding as the file has.
If you are working with non standard encodings, even trying to read the encoding with something like:
InputStreamReader r = new InputStreamReader(new FileInputStream(theFile));
r.getEncoding()
Can output with wrong values.
There's little library which handles recognition of encoding a bit better: https://code.google.com/archive/p/juniversalchardet/
It also has some holes in obtaining proper encoding, but I've used it.
And while using it I found out that most of non-standard encodings can be read with UTF-16 like:
new FileReader(fileName, StandardCharsets.UTF_16)
Since a while, Java supports usage of UTF-16 encoding. It's defined in Java standard API as StandardCharsets.UTF_16. That character set covers lots of language specific characters and emojis.

FileInputStream and Unicode in Java

I'm new to Java and I try to understand byte streams and character streams and I see that many people say that byte stream is suitable only for ASCII character set, and character stream can support all types of character sets ASCII, Unicode, etc. And I think there is a misunderstanding because I can use byte strem to read and write an Unicode character.
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
public class DemoApp {
public static void main(String args[]) {
FileInputStream fis = null;
FileOutputStream fos = null;
try {
fis = new FileInputStream("abc.txt");
fos = new FileOutputStream("def.txt");
int k;
while ((k = fis.read()) != -1) {
fos.write(k);
System.out.print((char) k);
}
}
catch (FileNotFoundException fnfe) {
System.out.printf("ERROR: %s", fnfe);
}
catch (IOException ioe) {
System.out.printf("ERROR: %s", ioe);
}
finally {
try {
if (fos != null)
fos.close();
}
catch (IOException ioe) {
System.out.printf("ERROR: %s", ioe);
}
try {
if (fis != null)
fis.close();
}
catch (IOException ioe) {
System.out.printf("ERROR: %s", ioe);
}
}
}
}
The abc.txt file contains the Unicode character Ǽ and I saved the file using UTF-8 encoding. And the code is working very good, it create a new file def.txt and this file contains the Unicode character Ǽ.
And I have 2 questions:
What is the truth about byte stream regarding Unicode character? Does byte stream support Unicode character or not?
When I try to print with s.o.p((char) k) the result is not an Unicode character, it is just ASCII character: Ǽ. And I don't understand why the result is not an Unicode character because I know that Java and char data type support Unicode character. I tried to save this code as UTF-8 but the problem persists.
Sorry for my english grammar and thank you in advance!
What is the truth about byte stream regarding Unicode character? Does byte stream support Unicode character or not?
In fact, there is no such thing as a "Unicode character". There are three distinct concepts that you should NOT mix up.
Unicode code points
Characters in some encoding of a sequence of code points.
The Java char type, which is neither of the above. Strictly speaking.
You need to do some serious background reading on this:
The Wikipedia pages on Unicode
https://www.w3.org/International/talks/0505-unicode-intro/
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Having cleared that up, we can say that while a byte stream can be used to read an encoding of a sequence of Unicode code points, the stream API design is NOT designed for the purpose of reading and writing character based text of any form. It is designed for reading and writing sequences of bytes (8 bit binary values) ... which may represent anything. The Stream API is designed to be agnostic of what the bytes represent: it doesn't know, and doesn't care!
When I try to print with s.o.p((char) k) the result is not an Unicode character, it is just ASCII character: Ǽ. And I don't understand why the result is not an Unicode character because I know that Java and char data type support Unicode character. I tried to save this code as UTF-8 but the problem persists.
(Correction. Those are NOT ASCII characters, they are LATIN-1 characters!)
The problem is not in Java. The problem is that a console is configured to expect text to be sent to it with a specific character encoding, but you are sending it characters with a different encoding.
When you read an write characters using a stream, the stream doesn't know and doesn't care about the encoding. So, if you read a file that is valid UTF-8 encoded text and use a stream to write it to a console that expects (for example) LATIN-1, then the result is typically garbage.
Another way to get garbage (which is what is happening here) is to read an encoded file as a sequence of bytes, and then cast the bytes to characters and print the characters. That is the wrong thing to do. If you want the characters to come out correctly, you need to decode the bytes into a sequence of characters and then print the characters. Casting is not decoding.
If you were reading the bytes via a Reader, the decoding would happen automatically, and you would not get that kind of mangling. (You might possibly get another kind ... if the console was not capable of displaying the characters, or if you configured the Reader stack to decode with the wrong charset.)
In summary: If you are trying to make a literal copy of a file (for example), use a byte stream. If you are trying to process the file as text, use a character stream.
The problem with your example code is that you appear to be trying to do both things at the same time with one pass through the file; i.e. make a literal copy of the file AND display it as text on the console. That is technically possible ... but difficult. My advice: don't try to do both things at the same time.

How to convert chunks of UTF-8 bytes to charcters?

I have a large UTF-8 input that is divided to 1-kB size chunks. I need to process it using a method that accepts String. Something like:
for (File file: inputs) {
byte[] b = FileUtils.readFileToByteArray(file);
String str = new String(b, "UTF-8");
processor.process(str);
}
My problem is that I have no guarantee that any UTF-8 character is not split between two chunks. The result of running my code is that some lines end with '?', which corrupts my input.
What would be a good approach to solve this?
If I understand correctly, you had a large text, which was encoded with UTF-8, then split into 1-kilobyte files. Now you want to read the text back, but you are concerned that an encoded character might be split across file boundaries, and cause a UTF-8 decoding error.
The API is a bit dusty, but there is a SequenceInputStream that will create what appears to be a single InputStream from a series of sub-streams. Create one of these with a collection of FileInputStream instances, then create an InputStreamReader that decodes the stream of UTF-8 bytes to text for your application.

Find out encoding directly from an input stream [duplicate]

I'm facing a problem.
A file can be written in some encoding such as UTF-8, UTF-16, UTF-32, etc.
When I read a UTF-16 file, I use the code below:
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(file), "UTF16"));
How can I determine which encoding the file is in before I read the file ?
When I read UTF-8 encoded file using UTF-16 I can't read the characters correctly.
There is no good way to do that. The question you're asking is like determining the radix of a number by looking at it. For example, what is the radix of 101?
Best solution would be to read the data into a byte array. Then you can use String(byte[] bytes, Charset charset) to test it with multiple encodings, most likely to least likely.
You cannot. Which transformation format applies is usually determined by the first four bytes of the file (assuming a BOM). You cannot see those just from the outside.
You can read the first few bytes and try to guess the encoding.
If all else fails, try reading with different encodings until one works (no exception when decoding and it 'looks' OK).

Character encoding UTF and ISO-8859-1 in CSV [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to add a UTF-8 BOM in java
My oracle database has a character set of UTF8.
I have a Java stored procedure which fetches record from the table and creates a csv file.
BLOB retBLOB = BLOB.createTemporary(conn, true, BLOB.DURATION_SESSION);
retBLOB.open(BLOB.MODE_READWRITE);
OutputStream bOut = retBLOB.setBinaryStream(0L);
ZipOutputStream zipOut = new ZipOutputStream(bOut);
PrintStream out = new PrintStream(zipOut,false,"UTF-8");
The german characters(fetched from the table) becomes gibberish in the csv if I use the above code. But if I change the encoding to use ISO-8859-1, then I can see the german characters properly in the csv file.
PrintStream out = new PrintStream(zipOut,false,"ISO-8859-1");
I have read in some posts which says that we should use UTF8 as it is safe and will also encode other language (chinese etc) properly which ISO-8859-1 will fail to do so.
Please suggest me which encoding I should use. (There are strong chances that we might have chinese/japanese words stored in the table in the future.)
You're currently only talking about one part of a process that is inherently two-sided.
Encoding something to bytes is only really relevant in the sense that some other process comes along and decodes it back into text at some later point. And of course, both processes need to use the same character set else the decode will fail.
So it sounds to me that the process that takes the BLOB out of the database and into the CSV file, is assuming that the bytes are an ISO-8859-1 encoding of text. Hence if you store them as UTF-8, the decoding messes (though the basic ASCII characters have the same byte representation in both, which is why they still decode correctly).
UTF-8 is a good character set to use in almost all circumstances, but it's not magic enough to overcome the immutable law that the same character set must be used for decoding as was used for encoding. So you can either change your CSV-creator to decode with UTF-8, else you'll have to continue encoding with ISO-8859-1.
I suppose your BLOB data is ISO-8859-1 encoded. As it's stored as binary and not as text its encoding is not depended on the databases encoding. You should check if the the BLOB was originaly written in UTF-8 encoding and if not, do so.
I think the problem is [Excel]csv could not figure out the utf8 encoding.
utf-8 csv issue
But I m still not able to resolve the issue even if I put a BOM on the PrintStream.
PrintStream out = new PrintStream(zipOut,false,"UTF-8");
out.write('\ufeff');
I also tried:
out.write(new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF });
but to no avail.

Categories