How does FileReader class exactly works with unicode text file - java

As far as I know FileReader class is used to read characters and Java uses Unicode format so it should read 2 byte with each .read() method means 1 character at time for a text file saved with Unicode format. But my code below is printing space before each character.
import java.io.*;
class pro1{
static int t;
public static void main(String argsp[]){
try{
File f=new File("note.txt");
FileReader g=new FileReader(f);
while((t=g.read())!=-1){
char b=(char)t;
System.out.println("lets see what happens: "+b);
}
}
catch(Exception e){
System.out.println("message: "+e);
}
}
}

...and java uses unicode format so it should read 2 byte with each .read() method...
Unicode is not a 2-byte format. Indeed, as Biffen says, it's not a character encoding at all. There are several Unicode character encodings: A mostly-single-byte one called UTF-8 (but some characters will be 2, 3, or 4 bytes), a mostly-two-byte one called UTF-16 (but some characters will be four), and an always-four-byte one called UTF-32. There are also variations of these. Obligatory link: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
When you create a FileReader the way you have, it uses the default encoding for your platform, which is likely to be Windows-1252, ISO 8859-1, or UTF-8. To specify the format, you can use an InputStreamReader constructed with a specific encoding.

Related

Use Get-Content in powershell as java input get extra character

I am practicing to use command line to run java script in windows 10.The java script is using scanner(System.in) to get input from a file and print the string it get from the file.The powershell command is as follow:
Get-Content source.txt | java test.TestPrint
The content of source.txt file is as follow:
:
a
2
!
And the TestPrint.java file is as follow:
package test;
import java.util.Scanner;
public class TestPrint {
public static void main(String[] args) {
// TODO Auto-generated method stub
Scanner in = new Scanner(System.in);
while(in.hasNextLine())
{
String str = in.nextLine();
if(str.equals("q")) break;
System.out.println( str );
}
}
}
Then weird thing happed.The result is
?:
a
2
!
You see,It add question mark into the begging of first line.Then when I change the character in first line of the source.txt file from ":" to "a",the result is
a
a
2
!
It add space into the begging of the first line.
I had tested the character and found the regularity:if the character is larger than "?" in ASCII,which is 63 in ASCII,then it will add space,such as "A"(65 in ASCII) or "["(91 in ASCII).If the character is smaller than "?", including "?" itself ,it will add question mark.
Could this be a Unicode issue (See: Java Unicode problems)? i.e. try specifying the type you want to read in:
Scanner in = new Scanner(System.in, "UTF-8");
EDIT:
Upon further research, PowerShell 5.1 and earlier, the default code page is Windows-1252. PowerShell 6+ and cross platform versions have switched to UTF-8. So (from the comments) you may have to specify Windows-1252 encoding:
Scanner in = new Scanner(System.in, "Windows-1252");
To find out what encoding is being used, execute the following in PowerShell:
[System.Text.Encoding]::Default
And you should be able to see what encoding is being used (for me in PowerShell v 5.1 it was Windows-1252, for PowerShell 6 it was UTF-8).
There is no text but encoded text.
Every program reading a text file or stream must know and use the same character encoding that the writer used.
An adaptive default character encoding is a 90s solution to a 70s and 80s problem (approx). Today, it's usually best to avoid constructors and methods that use a default, and in PowerShell, add an encoding argument where needed to control input or output.
To prevent data loss, you can use the Unicode character set throughout. UTF-8 is the most common for files and streams. (PowerShell and Java use UTF-16 for text datatypes.)
But you need to start from what you know the character encoding of the text file is. If you don't know this metadata, that's data loss right there.
Unicode provides that if a file or stream is known to be Unicode, it can start with metadata called a BOM. The BOM indicates which specific Unicode character encoding is being used and what the byte order is (for character encodings with code units longer than a byte). [This provision doesn't solve any problem that I've seen and causes problems of its own.]
(A character encoding, at the abstract level, is a map between codepoints and code units and is therefore independent of byte order. In practice, a character encoding takes the additional step of serializing/deserializing code units to/from byte sequences. So, sometimes using or not using a BOM is included in the encoding's name or description. A BOM might also be referred to as a signature. Ergo, "UTF-8 with signature.")
As metadata, a BOM, if present, should used if needed and always discarded when putting text into text datatypes. Unfortunately, Java's standard libraries don't discard the BOM. You can use popular libraries or a dozen or so lines of your own code to do this.
Again, start with the knowing the character encoding of the text file and inserting that metadata into the processing as an argument.

FileInputStream and Unicode in Java

I'm new to Java and I try to understand byte streams and character streams and I see that many people say that byte stream is suitable only for ASCII character set, and character stream can support all types of character sets ASCII, Unicode, etc. And I think there is a misunderstanding because I can use byte strem to read and write an Unicode character.
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
public class DemoApp {
public static void main(String args[]) {
FileInputStream fis = null;
FileOutputStream fos = null;
try {
fis = new FileInputStream("abc.txt");
fos = new FileOutputStream("def.txt");
int k;
while ((k = fis.read()) != -1) {
fos.write(k);
System.out.print((char) k);
}
}
catch (FileNotFoundException fnfe) {
System.out.printf("ERROR: %s", fnfe);
}
catch (IOException ioe) {
System.out.printf("ERROR: %s", ioe);
}
finally {
try {
if (fos != null)
fos.close();
}
catch (IOException ioe) {
System.out.printf("ERROR: %s", ioe);
}
try {
if (fis != null)
fis.close();
}
catch (IOException ioe) {
System.out.printf("ERROR: %s", ioe);
}
}
}
}
The abc.txt file contains the Unicode character Ǽ and I saved the file using UTF-8 encoding. And the code is working very good, it create a new file def.txt and this file contains the Unicode character Ǽ.
And I have 2 questions:
What is the truth about byte stream regarding Unicode character? Does byte stream support Unicode character or not?
When I try to print with s.o.p((char) k) the result is not an Unicode character, it is just ASCII character: Ǽ. And I don't understand why the result is not an Unicode character because I know that Java and char data type support Unicode character. I tried to save this code as UTF-8 but the problem persists.
Sorry for my english grammar and thank you in advance!
What is the truth about byte stream regarding Unicode character? Does byte stream support Unicode character or not?
In fact, there is no such thing as a "Unicode character". There are three distinct concepts that you should NOT mix up.
Unicode code points
Characters in some encoding of a sequence of code points.
The Java char type, which is neither of the above. Strictly speaking.
You need to do some serious background reading on this:
The Wikipedia pages on Unicode
https://www.w3.org/International/talks/0505-unicode-intro/
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Having cleared that up, we can say that while a byte stream can be used to read an encoding of a sequence of Unicode code points, the stream API design is NOT designed for the purpose of reading and writing character based text of any form. It is designed for reading and writing sequences of bytes (8 bit binary values) ... which may represent anything. The Stream API is designed to be agnostic of what the bytes represent: it doesn't know, and doesn't care!
When I try to print with s.o.p((char) k) the result is not an Unicode character, it is just ASCII character: Ǽ. And I don't understand why the result is not an Unicode character because I know that Java and char data type support Unicode character. I tried to save this code as UTF-8 but the problem persists.
(Correction. Those are NOT ASCII characters, they are LATIN-1 characters!)
The problem is not in Java. The problem is that a console is configured to expect text to be sent to it with a specific character encoding, but you are sending it characters with a different encoding.
When you read an write characters using a stream, the stream doesn't know and doesn't care about the encoding. So, if you read a file that is valid UTF-8 encoded text and use a stream to write it to a console that expects (for example) LATIN-1, then the result is typically garbage.
Another way to get garbage (which is what is happening here) is to read an encoded file as a sequence of bytes, and then cast the bytes to characters and print the characters. That is the wrong thing to do. If you want the characters to come out correctly, you need to decode the bytes into a sequence of characters and then print the characters. Casting is not decoding.
If you were reading the bytes via a Reader, the decoding would happen automatically, and you would not get that kind of mangling. (You might possibly get another kind ... if the console was not capable of displaying the characters, or if you configured the Reader stack to decode with the wrong charset.)
In summary: If you are trying to make a literal copy of a file (for example), use a byte stream. If you are trying to process the file as text, use a character stream.
The problem with your example code is that you appear to be trying to do both things at the same time with one pass through the file; i.e. make a literal copy of the file AND display it as text on the console. That is technically possible ... but difficult. My advice: don't try to do both things at the same time.

Why does my charset encoding conversion only work for lower case letters?

I have made a work around for my web application, as I failed to se the character encoding to UTF-8 in all scopes when first creating it. I made a simple character conversion java class, so that I could insert character encoding conversion where needed. These are my methods for that:
public static String encodeUTF8ToLatin(String s) throws UnsupportedEncodingException {
byte[] b = s.getBytes("UTF-8");
return new String(b, "ISO-8859-1");
}
public static String encodeLatinToUTF8(String s) throws UnsupportedEncodingException {
byte[] b = s.getBytes("ISO-8859-1");
return new String(b, "UTF-8");
}
I am using these methods due to the special Danish/Norwegian characters ÆØÅ æøå. These have been working well for a while now, but I just discovered that the second method can't convert Upper case characters. When sending the String "ÆØÅ æøå" it returns "?????? æøå". This confuses me, as the conversion table found here seems to claim that all six characters follow the same encoding.
Does anyone know why my upper case characters does not convert properly here?
UPDATE:
From the answers provided, I can tell that I have some serious gaps in my knowledge regarding charsets and encoding. I think I have to just go back to basics, read more, and then I'll decide if the question is salvageable afterwards.
Your encodeLatinToUTF8 converts a Unicode String to a byte array using UTF-8 encoding. Then it decodes that UTF-8 encoded byte array pretending that it is ISO-8859-1 (there is your problem) and converts it to a Unicode string.
Same with the other method.
Your methods are a bit pointless. Strings don't have encoding, as they are already decoded to characters. Character encoding is a way to represent characters as 8 bit numbers so it only makes sense in byte array context.
I finally made it work. I simply used "Windows-1252" instead of "ISO-8859-1" to get the bytes, before creating a new string, using UTF-8.
I created a new method, which works for both lower case and upper case letters:
public static String encodeWindows1252ToUTF8(String s) throws UnsupportedEncodingException {
byte[] b = s.getBytes("Windows-1252");
return new String(b, "UTF-8");
}
I found this answer, by referring to this page, which states:
Symptom The following characters fail, while other characters display
correctly:
€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ.
The trademark and Euro currency symbol, ellipsis, single and double
"smart quotes", en and em dash, and the OE ligature characters are
used frequently and are most likely to be reported as a symptom of
this problem.
Explanation The characters in the range 0x80-0x9F (128-159) ... are in
Windows-1252 and not in ISO-8859-1. If you have a problem with
characters in that range only, it is because the characters are
treated as ISO-8859-1 and not Windows-1252.
Look for references to ISO-8859-1 and replace them with "Windows-1252"
(or CP1252, or the correct character encoding name for the library or
platform you are using.)
The three characters that failed, was Æ Ø and Å, all including characters from the list above.

Read lines of characters and get file position

I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.
At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.
Questions
Is there an easy way to do both, preferably using standard Java libraries?
If not, what is a reasonable workaround?
Attributes of an ideal solution
An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.
Details
For the first requirement, BufferedReader.readLine() is attractive. But buffering clearly interferes with getting a meaningful file position.
Less obviously, InputStreamReader also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
The method RandomAccessFile.readLine() reads a single byte per character.
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.
If you construct a BufferedReader from a FileReader and keep an instance of the FileReader accessible to your code, you should be able to get the position of the next line by calling:
fileReader.getChannel().position();
after a call to bufferedReader.readLine().
The BufferedReader could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.
Alternate Solution
What would be wrong with keeping track of the bytes yourself:
long startingPoint = 0; // or starting position if this file has been previously processed
while (readingLines) {
String line = bufferedReader.readLine();
startingPoint += line.getBytes().length;
}
this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.
This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).
In UTF-8:
All single-byte characters can be distinguished from all bytes in multi-byte characters. All the bytes in a multi-byte character have a '1' in the high-order position. In particular, the bytes representing LF and CR cannot be part of a multi-byte character.
All single-byte characters are in 7-bit ASCII. So we can decode a file containing only 7-bit ASCII characters with a UTF-8 decoder.
Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.
To avoid problems with buffering, we can use RandomAccessFile. That class provides methods to read a line, and get/set the file position.
Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.
protected static String
readNextLineAsUTF8( RandomAccessFile in ) throws IOException {
String rv = null;
String lineBytes = in.readLine();
if ( null != lineBytes ) {
rv = new String( lineBytes.getBytes(),
StandardCharsets.UTF_8 );
}
return rv;
}
Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in:
long startPos = in.getFilePointer();
String line = readNextLineAsUTF8( in );
The case seems to be solved by VTD-XML, a library able to quickly parse big XML files:
The last java VTD-XML ximpleware implementation, currently 2.13 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258
I would suggest java.io.LineNumberReader. You can set and get the line number and therefore continue at a certain line index.
Since it is a BufferedReader it is also capable of handling UTF-8.
Solution A
Use RandomAccessFile.readChar() or RandomAccessFile.readByte() in a loop.
Check for your EOL characters, then process that line.
The problem with anything else is that you would have to absolutely make sure you never read past the EOL character.
readChar() returns a char not a byte. So you do not have to worry about character width.
Reads a character from this file. This method reads two bytes from the file, starting at the current file pointer.
[...]
This method blocks until the two bytes are read, the end of the stream is detected, or an exception is thrown.
By using a RandomAccessFile and not a Reader you are giving up Java's ability to decode the charset in the file for you. A BufferedReader would do so automatically.
There are several ways of over coming this. One is to detect the encoding yourself and then use the correct read*() method. The other way would be to use a BoundedInput stream.
There is one in this question Java: reading strings from a random access file with buffered input
E.g. https://stackoverflow.com/a/4305478/16549
RandomAccessFile has a function:
seek(long pos)
Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs.
Initially, I found the approach suggested by Andy Thomas (https://stackoverflow.com/a/30850145/556460) the most appropriate.
But unfortunately I couldn't succeed in converting the byte array (taken from RandomAccessFile.readLine) to correct string in cases when the file line contains non-latin characters.
So I reworked the approach by writing a function similar to RandomAccessFile.readLine itself that collects data from line not to a string, but to a byte array directly, and then construct the desired String from the byte array.
So the following below code completely satisfied my needs (in Kotlin).
After calling the function, file.channel.position() will return the exact position of the next line (if any):
fun RandomAccessFile.readEncodedLine(charset: Charset = Charsets.UTF_8): String? {
val lineBytes = ByteArrayOutputStream()
var c = -1
var eol = false
while (!eol) {
c = read()
when (c) {
-1, 10 -> eol = true // \n
13 -> { // \r
eol = true
val cur = filePointer
if (read() != '\n'.toInt()) {
seek(cur)
}
}
else -> lineBytes.write(c)
}
}
return if (c == -1 && lineBytes.size() == 0)
null
else
java.lang.String(lineBytes.toByteArray(), charset) as String
}

Writing unicode to rtf file

I´m trying write strings in diffrent languages to a rtf file. I hav tried a few different things.
I use japanese here as an example but it´s the same for other languages i have tried.
public void writeToFile(){
String strJapanese = "日本語";
DataOutputStream outStream;
File file = new File("C:\\file.rtf");
try{
outStream = new DataOutputStream(new FileOutputStream(file));
outStream.writeBytes(strJapanese);
outStream.close();
}catch (Exception e){
System.out.println(e.toString());
}
}
I alse have tried:
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
Or more specific:
byte[] b = strJapanese.getBytes("Shift-JIS");
String output = new String(b);
The output stream also has the writeUTF method:
outStream.writeUTF(strJapanese);
You can use the byte[] directly in the output stream with the write method. All of the above gives me garbled characters for everything except west european languages. To see if it works I have tried opening the result document in notepad++ and set the appropriate encoding. Also i have used OpenOffice where you get to choose encoding and font when opening the document.
If it does work but my computer can´t open it properly, is there a way to check that?
By default stings in JAVA are in UTF-8 (unicode), but when you want to write it down you need to specify encoding
try {
FileOutputStream fos = new FileOutputStream("test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
ref: http://download.oracle.com/javase/tutorial/i18n/text/stream.html
DataOutputStream outStream;
You probably don't want a DataOutputStream for writing an RTF file. DataOutputStream is for writing binary structures to a file, but RTF is text-based. Typically an OutputStreamWriter, setting the appropriate charset in the constructor would be the way to write to text files.
outStream.writeBytes(strJapanese);
In particular this fails because writeBytes really does write bytes, even though you pass it a String. A much more appropriate datatype would have been byte[], but that's just one of the places where Java's handling of bytes vs chars is confusing. The way it converts your string to bytes is simply by taking the lower eight bits of each UTF-16 code unit, and throwing the rest away. This results in ISO-8859-1 encoding with garbled nonsense for all the characters that don't exist in ISO-8859-1.
byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);
This doesn't really do anything useful. You encode to UTF-8 bytes and than decode that back to a String using the default charset. It's almost always a mistake to touch the default charset as it is unpredictable over different machines.
outStream.writeUTF(strJapanese);
This would be a better stab at writing UTF-8, but it's still not quite right as it uses Java's bogus “modified UTF-8” encoding, and more importantly RTF files don't actually support UTF-8, and shouldn't really directly include any non-ASCII characters at all.
Traditionally non-ASCII characters from 128 upwards should be written as hex bytes escapes like \'80, and the encoding for them is specified, if it is at all, in font \fcharset and \cpg escapes that are very, very annoying to deal with, and don't offer UTF-8 as one of the options.
In more modern RTF, you get \u1234x escapes as in Dabbler's answer (+1). Each escape encodes one UTF-16 code unit, which corresponds to a Java char, so it's not too difficult to regex-replace all non-ASCII characters with their escaped variants.
This is supported by Word 97 and later but some other tools may ignore the Unicode and fall back to the x replacement character.
RTF is not a very nice format.
You can write any Unicode character expressed as its decimal number by using the \u control word. E.g. \u1234? would represent the character whose Unicode code point is 1234, and ? is the replacement character for cases where the character cannot be adequadely represented (e.g. because the font doesn't contain it).

Categories