FileInputStream and Unicode in Java - java

I'm new to Java and I try to understand byte streams and character streams and I see that many people say that byte stream is suitable only for ASCII character set, and character stream can support all types of character sets ASCII, Unicode, etc. And I think there is a misunderstanding because I can use byte strem to read and write an Unicode character.
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
public class DemoApp {
public static void main(String args[]) {
FileInputStream fis = null;
FileOutputStream fos = null;
try {
fis = new FileInputStream("abc.txt");
fos = new FileOutputStream("def.txt");
int k;
while ((k = fis.read()) != -1) {
fos.write(k);
System.out.print((char) k);
}
}
catch (FileNotFoundException fnfe) {
System.out.printf("ERROR: %s", fnfe);
}
catch (IOException ioe) {
System.out.printf("ERROR: %s", ioe);
}
finally {
try {
if (fos != null)
fos.close();
}
catch (IOException ioe) {
System.out.printf("ERROR: %s", ioe);
}
try {
if (fis != null)
fis.close();
}
catch (IOException ioe) {
System.out.printf("ERROR: %s", ioe);
}
}
}
}
The abc.txt file contains the Unicode character Ǽ and I saved the file using UTF-8 encoding. And the code is working very good, it create a new file def.txt and this file contains the Unicode character Ǽ.
And I have 2 questions:
What is the truth about byte stream regarding Unicode character? Does byte stream support Unicode character or not?
When I try to print with s.o.p((char) k) the result is not an Unicode character, it is just ASCII character: Ǽ. And I don't understand why the result is not an Unicode character because I know that Java and char data type support Unicode character. I tried to save this code as UTF-8 but the problem persists.
Sorry for my english grammar and thank you in advance!

What is the truth about byte stream regarding Unicode character? Does byte stream support Unicode character or not?
In fact, there is no such thing as a "Unicode character". There are three distinct concepts that you should NOT mix up.
Unicode code points
Characters in some encoding of a sequence of code points.
The Java char type, which is neither of the above. Strictly speaking.
You need to do some serious background reading on this:
The Wikipedia pages on Unicode
https://www.w3.org/International/talks/0505-unicode-intro/
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Having cleared that up, we can say that while a byte stream can be used to read an encoding of a sequence of Unicode code points, the stream API design is NOT designed for the purpose of reading and writing character based text of any form. It is designed for reading and writing sequences of bytes (8 bit binary values) ... which may represent anything. The Stream API is designed to be agnostic of what the bytes represent: it doesn't know, and doesn't care!
When I try to print with s.o.p((char) k) the result is not an Unicode character, it is just ASCII character: Ǽ. And I don't understand why the result is not an Unicode character because I know that Java and char data type support Unicode character. I tried to save this code as UTF-8 but the problem persists.
(Correction. Those are NOT ASCII characters, they are LATIN-1 characters!)
The problem is not in Java. The problem is that a console is configured to expect text to be sent to it with a specific character encoding, but you are sending it characters with a different encoding.
When you read an write characters using a stream, the stream doesn't know and doesn't care about the encoding. So, if you read a file that is valid UTF-8 encoded text and use a stream to write it to a console that expects (for example) LATIN-1, then the result is typically garbage.
Another way to get garbage (which is what is happening here) is to read an encoded file as a sequence of bytes, and then cast the bytes to characters and print the characters. That is the wrong thing to do. If you want the characters to come out correctly, you need to decode the bytes into a sequence of characters and then print the characters. Casting is not decoding.
If you were reading the bytes via a Reader, the decoding would happen automatically, and you would not get that kind of mangling. (You might possibly get another kind ... if the console was not capable of displaying the characters, or if you configured the Reader stack to decode with the wrong charset.)
In summary: If you are trying to make a literal copy of a file (for example), use a byte stream. If you are trying to process the file as text, use a character stream.
The problem with your example code is that you appear to be trying to do both things at the same time with one pass through the file; i.e. make a literal copy of the file AND display it as text on the console. That is technically possible ... but difficult. My advice: don't try to do both things at the same time.

Related

Best delimiter to safely parse byte arrays from a stream

I have a byte stream that returns a sequence of byte arrays, each of which represents a single record.
I would like to parse the stream into a list of individual byte[]s. Currently, i have hacked in a three byte delimiter so that I can identify the end of each record, but have concerns.
I see that there is a standard Ascii record separator character.
30 036 1E 00011110 RS  Record Separator
Is it safe to use a byte[] derived from this character a delimiter if the byte arrays (which were UTF-8 encoded) have been compressed and/or encrypted? My concern is that the encryption/compression output might produce the record separator for some other purpose. Please note the individual byte[] records are compressed/encrypted, rather than the entire stream.
I am working in Java 8 and using Snappy for compression. I haven't picked an encryption library yet, but it would certainly be one of the stronger, standard, private key approaches.
You can't simply declare a byte as delimiter if you're working with random unstructured data (which compressed/encrypted data resembles quite closely), because the delimiter can always appear as a regular data byte in such data.
If the size of the data is already known when you start writing, just generally write the size first and then the data. When reading back you then know you need th read the size first (e.g. 4 bytes for an int), and then as many bytes the size indicates.
This will obviously not work if you can't tell the size while writing. In that case, you can use an escaping mechanism, e.g. select a rarely appearing byte as the escapce character, escape all occurances of that byte in the data and use a different byte as end indicator.
e.g.
final static byte ESCAPE = (byte) 0xBC;
final static byte EOF = (byte) 0x00;
OutputStream out = ...
for (byte b : source) {
if (b == ESCAPE) {
// escape data bytes that have the value of ESCAPE
out.write(ESCAPE);
out.write(ESCAPE);
} else {
out.write(b);
}
}
// write EOF marker ESCAPE, EOF
out.write(ESCAPE);
out.write(EOF);
Now when reading and you read the ESCAPE byte, you read thex next byte and check for EOF. If its not EOF its an escaped ESCAPE that represents a data byte.
InputStream in = ...
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
while ((int b = in.read()) != -1) {
if (b == ESCAPE) {
b = in.read();
if (b == EOF)
break;
buffer.write(b);
} else {
buffer.write(b);
}
}
If the bytes to be written are perfectly randomly distributed this will increase the stream length by 1/256, for data domains that are not completely random, you can select the byte that is least frequently appearing (by static data analysis or just educated guess).
Edit: you can reduce the escaping overhead by using more elaborate logic, e.g. the example can only create ESCAPE + ESCAPE or ESCAPE + EOF. The other 254 bytes can never follow an ESCAPE in the example, so that could be exploited to store legal data combinations.
It is completely unsafe, you never know what might turn up in your data. Perhaps you should consider something like protobuf, or a scheme like 'first write the record length, then write the record, then rinse, lather, repeat'?
If you have a length, you don't need a delimiter. Your reading side reads the length, then knows how much to read for the first record, and then knows to read the next length -- all assuming that the lengths themselves are fixed-length.
See the developers' suggestions for streaming a sequence of protobufs.

Read lines of characters and get file position

I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.
At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.
Questions
Is there an easy way to do both, preferably using standard Java libraries?
If not, what is a reasonable workaround?
Attributes of an ideal solution
An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.
Details
For the first requirement, BufferedReader.readLine() is attractive. But buffering clearly interferes with getting a meaningful file position.
Less obviously, InputStreamReader also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
The method RandomAccessFile.readLine() reads a single byte per character.
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.
If you construct a BufferedReader from a FileReader and keep an instance of the FileReader accessible to your code, you should be able to get the position of the next line by calling:
fileReader.getChannel().position();
after a call to bufferedReader.readLine().
The BufferedReader could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.
Alternate Solution
What would be wrong with keeping track of the bytes yourself:
long startingPoint = 0; // or starting position if this file has been previously processed
while (readingLines) {
String line = bufferedReader.readLine();
startingPoint += line.getBytes().length;
}
this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.
This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).
In UTF-8:
All single-byte characters can be distinguished from all bytes in multi-byte characters. All the bytes in a multi-byte character have a '1' in the high-order position. In particular, the bytes representing LF and CR cannot be part of a multi-byte character.
All single-byte characters are in 7-bit ASCII. So we can decode a file containing only 7-bit ASCII characters with a UTF-8 decoder.
Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.
To avoid problems with buffering, we can use RandomAccessFile. That class provides methods to read a line, and get/set the file position.
Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.
protected static String
readNextLineAsUTF8( RandomAccessFile in ) throws IOException {
String rv = null;
String lineBytes = in.readLine();
if ( null != lineBytes ) {
rv = new String( lineBytes.getBytes(),
StandardCharsets.UTF_8 );
}
return rv;
}
Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in:
long startPos = in.getFilePointer();
String line = readNextLineAsUTF8( in );
The case seems to be solved by VTD-XML, a library able to quickly parse big XML files:
The last java VTD-XML ximpleware implementation, currently 2.13 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258
I would suggest java.io.LineNumberReader. You can set and get the line number and therefore continue at a certain line index.
Since it is a BufferedReader it is also capable of handling UTF-8.
Solution A
Use RandomAccessFile.readChar() or RandomAccessFile.readByte() in a loop.
Check for your EOL characters, then process that line.
The problem with anything else is that you would have to absolutely make sure you never read past the EOL character.
readChar() returns a char not a byte. So you do not have to worry about character width.
Reads a character from this file. This method reads two bytes from the file, starting at the current file pointer.
[...]
This method blocks until the two bytes are read, the end of the stream is detected, or an exception is thrown.
By using a RandomAccessFile and not a Reader you are giving up Java's ability to decode the charset in the file for you. A BufferedReader would do so automatically.
There are several ways of over coming this. One is to detect the encoding yourself and then use the correct read*() method. The other way would be to use a BoundedInput stream.
There is one in this question Java: reading strings from a random access file with buffered input
E.g. https://stackoverflow.com/a/4305478/16549
RandomAccessFile has a function:
seek(long pos)
Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs.
Initially, I found the approach suggested by Andy Thomas (https://stackoverflow.com/a/30850145/556460) the most appropriate.
But unfortunately I couldn't succeed in converting the byte array (taken from RandomAccessFile.readLine) to correct string in cases when the file line contains non-latin characters.
So I reworked the approach by writing a function similar to RandomAccessFile.readLine itself that collects data from line not to a string, but to a byte array directly, and then construct the desired String from the byte array.
So the following below code completely satisfied my needs (in Kotlin).
After calling the function, file.channel.position() will return the exact position of the next line (if any):
fun RandomAccessFile.readEncodedLine(charset: Charset = Charsets.UTF_8): String? {
val lineBytes = ByteArrayOutputStream()
var c = -1
var eol = false
while (!eol) {
c = read()
when (c) {
-1, 10 -> eol = true // \n
13 -> { // \r
eol = true
val cur = filePointer
if (read() != '\n'.toInt()) {
seek(cur)
}
}
else -> lineBytes.write(c)
}
}
return if (c == -1 && lineBytes.size() == 0)
null
else
java.lang.String(lineBytes.toByteArray(), charset) as String
}

How does FileReader class exactly works with unicode text file

As far as I know FileReader class is used to read characters and Java uses Unicode format so it should read 2 byte with each .read() method means 1 character at time for a text file saved with Unicode format. But my code below is printing space before each character.
import java.io.*;
class pro1{
static int t;
public static void main(String argsp[]){
try{
File f=new File("note.txt");
FileReader g=new FileReader(f);
while((t=g.read())!=-1){
char b=(char)t;
System.out.println("lets see what happens: "+b);
}
}
catch(Exception e){
System.out.println("message: "+e);
}
}
}
...and java uses unicode format so it should read 2 byte with each .read() method...
Unicode is not a 2-byte format. Indeed, as Biffen says, it's not a character encoding at all. There are several Unicode character encodings: A mostly-single-byte one called UTF-8 (but some characters will be 2, 3, or 4 bytes), a mostly-two-byte one called UTF-16 (but some characters will be four), and an always-four-byte one called UTF-32. There are also variations of these. Obligatory link: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
When you create a FileReader the way you have, it uses the default encoding for your platform, which is likely to be Windows-1252, ISO 8859-1, or UTF-8. To specify the format, you can use an InputStreamReader constructed with a specific encoding.

Chinese Strings Handling in Java? [duplicate]

This question already has answers here:
How to read UTF 8 encoded file in java with turkish characters
(3 answers)
Closed 9 years ago.
In my assigned project, the original author has written a function:
public String asString() throws DataException
{
if (getData() == null) return null;
CharBuffer charBuf = null;
try
{
charBuf = s_charset.newDecoder().decode(ByteBuffer.wrap(f_data));
}
catch (CharacterCodingException e)
{
throw new DataException("You can't have a string from this ParasolBlob: " + this, e);
}
return charBuf.toString()+"你好";
}
Please note that the constant s_charset is defined as:
private static final Charset s_charset = Charset.forName("UTF-8");
Please also note that I have hard-coded a Chinese string in the return string.
Now when the program flow reaches this method, it will throw the following exception:
java.nio.charset.UnmappableCharacterException: Input length = 2
And more interstingly, the hard-coded Chinese strings will be shown as "??" at the console if I do a System.out.println().
I think this problem is quite interesting in regard of Localization. And I've tried changing it to
Charset.forName("GBK");
but seems is not the solution. Also, I have set the coding of the Java class to be of "UTF-8".
Any experts have experience in this regard? Would you please share a little? Thanks in advance!
And more interstingly, the hard-coded Chinese strings will be shown as
"??" at the console if I do a System.out.println().
System.out performs transcoding operations from UTF-16 strings to the default JRE character encoding. If this does not match the encoding used by the device receiving the character data is corrupted. So, the console should be set to use the right character encoding(UTF-8) to render the chinese chars properly.
If you are using eclipse then you can change the console encoding by going to
Run Configuration-> Common -> Encoding(slect UTF-8 from dropdown)
Java Strings are unicodes
System.out.println("你好");
As Kevin stated, depending on what is the underlying encoding of your source file, this encoding will be used to convert it to UTF-16BE (real encoding of Java String). So when you see "??" it is surely simple conversion error.
Now, if you want to convert simple byte array to String, using given character encoding, I believe there is much easier way to do this, than using raw CharsetDecoder. That is:
byte[] bytes = {0x61};
String string = new String(bytes, Charset.forName("UTF-8"));
System.out.println(string);
This will work, provided that the byte array really contains UTF-8 encoded stream of bytes. And it must be without BOM, otherwise the conversion will probably fail. Make sure that what you are trying to convert does not start with the sequence 0xEF 0xBB 0xBF.

Writing unicode to rtf file

I´m trying write strings in diffrent languages to a rtf file. I hav tried a few different things.
I use japanese here as an example but it´s the same for other languages i have tried.
public void writeToFile(){
String strJapanese = "日本語";
DataOutputStream outStream;
File file = new File("C:\\file.rtf");
try{
outStream = new DataOutputStream(new FileOutputStream(file));
outStream.writeBytes(strJapanese);
outStream.close();
}catch (Exception e){
System.out.println(e.toString());
}
}
I alse have tried:
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
Or more specific:
byte[] b = strJapanese.getBytes("Shift-JIS");
String output = new String(b);
The output stream also has the writeUTF method:
outStream.writeUTF(strJapanese);
You can use the byte[] directly in the output stream with the write method. All of the above gives me garbled characters for everything except west european languages. To see if it works I have tried opening the result document in notepad++ and set the appropriate encoding. Also i have used OpenOffice where you get to choose encoding and font when opening the document.
If it does work but my computer can´t open it properly, is there a way to check that?
By default stings in JAVA are in UTF-8 (unicode), but when you want to write it down you need to specify encoding
try {
FileOutputStream fos = new FileOutputStream("test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
ref: http://download.oracle.com/javase/tutorial/i18n/text/stream.html
DataOutputStream outStream;
You probably don't want a DataOutputStream for writing an RTF file. DataOutputStream is for writing binary structures to a file, but RTF is text-based. Typically an OutputStreamWriter, setting the appropriate charset in the constructor would be the way to write to text files.
outStream.writeBytes(strJapanese);
In particular this fails because writeBytes really does write bytes, even though you pass it a String. A much more appropriate datatype would have been byte[], but that's just one of the places where Java's handling of bytes vs chars is confusing. The way it converts your string to bytes is simply by taking the lower eight bits of each UTF-16 code unit, and throwing the rest away. This results in ISO-8859-1 encoding with garbled nonsense for all the characters that don't exist in ISO-8859-1.
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
This doesn't really do anything useful. You encode to UTF-8 bytes and than decode that back to a String using the default charset. It's almost always a mistake to touch the default charset as it is unpredictable over different machines.
outStream.writeUTF(strJapanese);
This would be a better stab at writing UTF-8, but it's still not quite right as it uses Java's bogus “modified UTF-8” encoding, and more importantly RTF files don't actually support UTF-8, and shouldn't really directly include any non-ASCII characters at all.
Traditionally non-ASCII characters from 128 upwards should be written as hex bytes escapes like \'80, and the encoding for them is specified, if it is at all, in font \fcharset and \cpg escapes that are very, very annoying to deal with, and don't offer UTF-8 as one of the options.
In more modern RTF, you get \u1234x escapes as in Dabbler's answer (+1). Each escape encodes one UTF-16 code unit, which corresponds to a Java char, so it's not too difficult to regex-replace all non-ASCII characters with their escaped variants.
This is supported by Word 97 and later but some other tools may ignore the Unicode and fall back to the x replacement character.
RTF is not a very nice format.
You can write any Unicode character expressed as its decimal number by using the \u control word. E.g. \u1234? would represent the character whose Unicode code point is 1234, and ? is the replacement character for cases where the character cannot be adequadely represented (e.g. because the font doesn't contain it).

Categories