Reading an ASCII file with FileChannel and ByteArrays

Reading an ASCII file with FileChannel and ByteArrays - java

I have the following code:
String inputFile = "somefile.txt";
FileInputStream in = new FileInputStream(inputFile);
FileChannel ch = in.getChannel();
ByteBuffer buf = ByteBuffer.allocateDirect(BUFSIZE); // BUFSIZE = 256
/* read the file into a buffer, 256 bytes at a time */
int rd;
while ( (rd = ch.read( buf )) != -1 ) {
buf.rewind();
for ( int i = 0; i < rd/2; i++ ) {
/* print each character */
System.out.print(buf.getChar());
}
buf.clear();
}
But the characters get displayed at ?'s. Does this have something to do with Java using Unicode characters? How do I correct this?

You have to know what the encoding of the file is, and then decode the ByteBuffer into a CharBuffer using that encoding. Assuming the file is ASCII:
import java.util.*;
import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.*;
public class Buffer
{
public static void main(String args[]) throws Exception
{
String inputFile = "somefile";
FileInputStream in = new FileInputStream(inputFile);
FileChannel ch = in.getChannel();
ByteBuffer buf = ByteBuffer.allocateDirect(BUFSIZE); // BUFSIZE = 256
Charset cs = Charset.forName("ASCII"); // Or whatever encoding you want
/* read the file into a buffer, 256 bytes at a time */
int rd;
while ( (rd = ch.read( buf )) != -1 ) {
buf.rewind();
CharBuffer chbuf = cs.decode(buf);
for ( int i = 0; i < chbuf.length(); i++ ) {
/* print each character */
System.out.print(chbuf.get());
}
buf.clear();
}
}
}

buf.getChar() is expecting 2 bytes per character but you are only storing 1. Use:
System.out.print((char) buf.get());

Changing your print statement to:
System.out.print((char)buf.get());
Seems to help.

Depending on the encoding of somefile.txt, a character may not actually be composed of two bytes. This page gives more information about how to read streams with the proper encoding.
The bummer is, the file system doesn't tell you the encoding of the file, because it doesn't know. As far as it's concerned, it's just a bunch of bytes. You must either find some way to communicate the encoding to the program, detect it somehow, or (if possible) always ensure that the encoding is the same (such as UTF-8).

Is there a particular reason why you are reading the file in the way that you do?
If you're reading in an ASCII file you should really be using a Reader.
I would do it something like:
File inputFile = new File("somefile.txt");
BufferedReader reader = new BufferedReader(new FileReader(inputFile));
And then use either readLine or similar to actually read in the data!

Yes, it is Unicode.
If you have 14 Chars in your File, you only get 7 '?'.
Solution pending. Still thinking.

Related

Java Nio ByteBuffer truncate unicode characters when buffer reaches its bound

I was writing a function in java that can read file and get its content to String:
public static String ReadFromFile(String fileLocation) {
StringBuilder result = new StringBuilder();
RandomAccessFile randomAccessFile = null;
FileChannel fileChannel = null;
try {
randomAccessFile = new RandomAccessFile(fileLocation, "r");
fileChannel = randomAccessFile.getChannel();
ByteBuffer byteBuffer = ByteBuffer.allocate(10);
CharBuffer charBuffer = null;
int bytesRead = fileChannel.read(byteBuffer);
while (bytesRead != -1) {
byteBuffer.flip();
charBuffer = StandardCharsets.UTF_8.decode(byteBuffer);
result.append(charBuffer.toString());
byteBuffer.clear();
bytesRead = fileChannel.read(byteBuffer);
}
} catch (IOException ignored) {
} finally {
try {
if (fileChannel != null)
fileChannel.close();
if (randomAccessFile != null)
randomAccessFile.close();
} catch (IOException ignored) {
}
}
return result.toString();
}
From code above you can see that I set 'ByteBuffer.allocate' only 10 bytes on purpose to make things clearer.
Now I want to read a file named "test.txt" that contains unicode charaters in Chinese like this:
乐正绫我爱你乐正绫我爱你
Below is my test code for it:
System.out.println(ReadFromFile("test.txt"));
Expected Output in Console
乐正绫我爱你乐正绫我爱你
Actual Output in Console
乐正绫���爱你��正绫我爱你
Possible Reason
ByteBuffer only allocated 10 bytes, thus unicode characters are truncated every 10 bytes.
Attempt To Solve
Increase ByteBuffer allocated bytes to 20, I got the result below:
乐正绫我爱你��正绫我爱你
Not A Robust Solution
Allocate ByteBuffer to a very huge number, like 102400, but it is not practical when it comes to very huge text files.
Question
How to solve this problem?

You can't, since you don't know how many bytes are used for each character in UTF-8 encoding, and you really don't want to rewrite that logic.
There's Files.readString() in Java 11, for lower versions you can use Files.readAllBytes() e.g.
Path path = new File(fileLocation).toPath()
String contents = new String(Files.readAllBytes(path), StandardCharsets.UTF_8);

Character missing when use the InputStreamReader class in Java

I had written some codes to read from a text file char by char and then print it to the screen,but the result had made me feel confused,here it is:
this is the code that i had written
import java.io.*;
import java.nio.charset.StandardCharsets;
public class learnIO
{
public static void main(String[] args) throws IOException{
var in = new InputStreamReader(new FileInputStream("test1.txt"), StandardCharsets.UTF_8);
while(in.read() != -1){
System.out.println((char)in.read());
}
}
}
the content and encoding scheme of the file:
file test1.txt
test1.txt: ASCII text
cat test1.txt
hello, world!
the result is:
e
l
,
w
r
d
some char had missed，Why did this happen？

return type of read method of InputStreamReader is int that takes 4 bytes
and char type is 2 bytes so casting int to char you skip 2 bytes
refer to https://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html

You need to use InputStreamReader inside BufferedReader as from the official oracle documentation it says that
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or maybe given explicitly, or the platform's default charset may be accepted.
Each invocation of one of an InputStreamReader's read() methods may cause one or more bytes to be read from the underlying byte-input stream.
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
For top efficiency, consider wrapping an InputStreamReader within a BufferedReader. For example:
BufferedReader in
= new BufferedReader(new InputStreamReader(System.in));
So the solution to your problem can be solved using the following code
try {
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("hello.txt");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
//Read File Line By Line
char c;
while ((c = (char) br.read()) != (char) -1) {
// Print the content on the console
String character = Character.toString(c);
System.out.println(character);
}
//Close the input stream
in.close();
} catch (Exception e) {//Catch exception if any
System.err.println("Error: " + e.getMessage());
}

How to convert a character into an integer in Java?

I am a beginner at Java, trying to figure out how to convert characters from a text file into integers. In the process, I wrote a program which generates a text file showing what characters are generated by what integers.
package numberchars;
import java.io.FileWriter;
import java.io.IOException;
import java.io.FileReader;
import java.lang.Character;
public class Numberchars {
public static void main(String[] args) throws IOException {
FileWriter outputStream = new FileWriter("NumberChars.txt");
//Write to the output file the char corresponding to the decimal
// from 1 to 255
int counter = 1;
while (counter <256)
{
outputStream.write(counter);
outputStream.flush();
counter++;
}
outputStream.close();
This generated NumberChars.txt, which had all the numbers, all the letters both upper and lower case, surrounded at each end by other symbols and glyphs.
Then I tried to read this file and convert its characters back into integers:
FileReader inputStream = new FileReader("NumberChars.txt");
FileWriter outputStream2 = new FileWriter ("CharNumbers.txt");
int c;
while ((c = inputStream.read()) != -1)
{
outputStream2.write(Character.getNumericValue(c));
outputStream2.flush();
}
}
}
The resulting file, CharNumbers.txt, began with the same glyphs as NumberChars.txt but then was blank. Opening the files in MS Word, I found NumberChars had 248 characters (including 5 spaces) and CharNumbers had 173 (including 8 spaces).
So why didn't the Character.getNumericValue(c) result in an integer written to CharNumbers.txt? And given that it didn't, why at least didn't it write an exact copy of NumberChars.txt? Any help much appreciated.

Character.getNumericValue doesn't do what you think it does. If you read the Javadoc:
Returns the int value that the specified character (Unicode code point) represents. For example, the character '\u216C' (the Roman numeral fifty) will return an int with a value of 50.
On error it returns -1 (which looks like 0xFF_FF_FF_FF in 2s complement).
Most characters don't have such a "numeric value," so you write the ints out, each padded to 2 bytes (more on that later), read them back in the same way, and then start writing a whole lot of 0xFFFF (-1 truncated to 2 bytes) courtesy of a misplaced Character.getNumericValue. I'm not sure what MS Word is doing, but it's probably getting confused what the encoding of your file is and glomming all those bytes into 0xFF_FF_FF_FF (because the high bits of each byte are set) and treating that as one character. (Use a text editor more suited to this kind of stuff like Notepad++, btw.) If you were to measure your file's size on disk in bytes it will probably still be 256 chars * 2 bytes/chars = 512 bytes.
I'm not sure what you meant to do here, so I'll note that InputStreamReader and OutputStreamWriter work on a (Unicode) character basis, with an encoder that defaults to the system one. That's why your ints are padded/truncated to 2 bytes. If you wanted pure byte IO, use FileInputStream/FileOutputStream. If you wanted to read and write the ints as Strings, you need to use FileWriter/FileReader, but not like you did.
// Just bytes
// This is a try-with-resources. It executes the code with the decls in it
// but is also like an implicit finally block that calls `close()` on each resource.
try(FileOutputStream fos = new FileOutputStream("bytes.bin")) {
for(int b = 0; b < 256; b++) { // Bytes are signed so we use int.
// This takes an int and truncates it for the lowest byte
fos.write(b);
// Can also fill a byte[] and dump it all at once with overloaded write.
}
}
byte[] bytes = new bytes[256];
try(FileInputStream fis = new FileInputStream("bytes.bin")) {
// Reads up to bytes.length bytes into bytes
fis.read(bytes);
}
// Foreach loop. If you don't know what this does, I think you can figure out from the name.
for(byte b : bytes) {
System.out.println(b);
}
// As Strings
try(FileWriter fw = new FileWriter("strings.txt")) {
for(int i = 0; i < 256; i++) {
// You need a delimiter lest you not be able to tell 12 from 1,2 when you read
// Uses system default encoding
fw.write(Integer.toString(i) + "\n");
}
}
byte[] bytes = new byte[256];
try(
FileReader fr = new FileReader("strings.txt");
// FileReaders can't do stuff like "read one line to String" so we wrap it
BufferedReader br = new BufferedReader(fr);
) {
for(int i = 0; i < 256; i++) {
bytes[i] = Byte.valueOf(br.readLine());
}
}
for(byte b : bytes) {
System.out.println(b);
}

public class MyCLAss {
public static void main(String[] args)
{
char x='b';
System.out.println(+x);//just by witting a plus symbol before the variable you can find it's ascii value....it will give 98.
}
}

Sending string as byte array from C# to Java via socket

I am trying the following:
C# Client:
string stringToSend = "Hello man";
BinaryWriter writer = new BinaryWriter(mClientSocket.GetStream(),Encoding.UTF8);
//write number of bytes:
byte[] headerBytes = BitConverter.GetBytes(stringToSend.Length);
mClientSocket.GetStream().Write(headerBytes, 0, headerBytes.Length);
//write text:
byte[] textBytes = System.Text.Encoding.UTF8.GetBytes(stringToSend);
writer.Write(textBytes, 0, textBytes.Length);
Java Server:
Charset utf8 = Charset.forName("UTF-8");
BufferedReader in = new BufferedReader(new InputStreamReader(clientSocket.getInputStream(), utf8));
while (true) {
//we read header first
int headerSize = in.read();
int bytesRead = 0;
char[] input = new char[headerSize];
while (bytesRead < headerSize)
{
bytesRead += in.read(input, bytesRead, headerSize - bytesRead);
}
String resString = new String(input);
System.out.println(resString);
if (resString.equals("!$$$")) {
break;
}
}
The string size equals 9.That's correct on both sides.But, when I am reading the string iteself on the Java side, the data looks wrong.The char buffer ('input' variable)content looks like this:
",",",'H','e','l','l','o',''
I tried to change endianness with reversing the byte array.Also tried changing string encoding format between ASCII and UTF-8.I still feel like it relates to the endianness problem,but can not figure out how to solve it.I know I can use other types of writers in order to write text data to the steam,but I am trying using raw byte arrays for the sake of learning.

These
byte[] headerBytes = BitConverter.GetBytes(stringToSend.Length);
are 4 bytes. And they aren't character data so it makes no sense to read them with a BufferedReader. Just read the bytes directly.
byte[] headerBytes = new byte[4];
// shortcut, make sure 4 bytes were actually read
in.read(headerBytes);
Now extract your text's length and allocate enough space for it
int length = ByteBuffer.wrap(headerBytes).getInt();
byte[] textBytes = new byte[length];
Then read the text
int remaining = length;
int offset = 0;
while (remaining > 0) {
int count = in.read(textBytes, offset, remaining);
if (-1 == count) {
// deal with it
break;
}
remaining -= count;
offset += count;
}
Now decode it as UTF-8
String text = new String(textBytes, StandardCharsets.UTF_8);
and you are done.
Endianness will have to match for those first 4 bytes. One way of ensuring that is to use "network order" (big-endian). So:
C# Client
byte[] headerBytes = BitConverter.GetBytes(IPAddress.HostToNetworkOrder(stringToSend.Length));
Java Server
int length = ByteBuffer.wrap(headerBytes).order(ByteOrder.BIG_ENDIAN).getInt();

At first glance it appears you have a problem with your indexes.
You C# code is sending an integer converted to 4 bytes.
But you Java Code is only reading a single byte as the length of the string.
The next 3 bytes sent from C# are going to the three zero bytes from your string length.
You Java code is reading those 3 zero bytes and converting them to empty characters which represent the first 3 empty characters of your input[] array.
C# Client:
string stringToSend = "Hello man";
BinaryWriter writer = new BinaryWriter(mClientSocket.GetStream(),Encoding.UTF8);
//write number of bytes: Original line was sending the entire string here. Optionally if you string is longer than 255 characters, you'll need to send another data type, perhaps an integer converted to 4 bytes.
byte[] textBytes = System.Text.Encoding.UTF8.GetBytes(stringToSend);
mClientSocket.GetStream().Write((byte)textBytes.Length);
//write text the entire buffer
writer.Write(textBytes, 0, textBytes.Length);
Java Server:
Charset utf8 = Charset.forName("UTF-8");
BufferedReader in = new BufferedReader(new InputStreamReader(clientSocket.getInputStream(), utf8));
while (true) {
//we read header first
// original code was sending an integer as 4 bytes but was only reading a single char here.
int headerSize = in.read();// read a single byte from the input
int bytesRead = 0;
char[] input = new char[headerSize];
// no need foe a while statement here:
bytesRead = in.read(input, 0, headerSize);
// if you are going to use a while statement, then in each loop
// you should be processing the input but because it will get overwritten on the next read.
String resString = new String(input, utf8);
System.out.println(resString);
if (resString.equals("!$$$")) {
break;
}
}

Read file in java

I have file in my computer which have .file extension , I want to read it 9 character by 9 character. I know that I can read file by this code, but what should I do when my file is not .txt?does java support to read .file s with this code?
InputStream is = null;
InputStreamReader isr = null;
BufferedReader br = null;
is = new FileInputStream("c:/test.txt");
// create new input stream reader
isr = new InputStreamReader(is);
// create new buffered reader
br = new BufferedReader(isr);
// creates buffer
char[] cbuf = new char[is.available()];
for (int i = 0; i < 90000000; i += 9) {
// reads characters to buffer, offset i, len 9
br.read(cbuf, i, 9);}

The extension of a file is totally irrelevant. Extensions like .txt are mere conventions to help your operating system choose the right program when you open it.
So you can store text in any file (.txt, .file, .foobar if you are so inclined...), provided you know what kind of data it contains, and read it accordingly from your program.
So yes, Java can read .file files, and your code will work fine if that file contains text.

does java support to read .file s with this code?
No, since c:/test.txt is hard coded. If it wouldn't yes it would support it.

Yes it's possible if you write is = new FileInputStream("c:/test.file");

Yes, it reads any file you give it the same way. You can pass any file path with any extension to the FileInputStream constructor.

Anyone can read any file you want, since a file is just a sequence of bytes. The extension tells you in what format the bytes should be read, so when we have a .txt file we know that this is a file with sequences of characters.
When you have a file format called .file we know that it should be (according to you) a 9x9 set of characters. This way we know what to read and do that.
Since the .file format is characters I would say yes, you can read that with your code for instance with this:
public String[] readFileFormat (final File file) throws IOException {
if (file.exists()) {
final String[] lines = new String[9];
final BufferedReader reader = new BufferedReader ( new FileReader( file ) );
for ( int i = 0; i < lines.length; i++ ) {
lines[i] = reader.readLine();
if (lines[i] == null || lines[i].isEmpty() || lines[i].length() < 9)
throw new RuntimeException ("Line is empty when it should be filled!");
else if (lines[i].length() > 9)
throw new RuntimeException ("Line does not have exactly 9 characters!");
}
reader.close();
return lines;
}
return null;
}

The extension is totally irrelevant, so it can be .file, .txt or whatever you want it to be.
Here is an example of reading in a file with BuffereInputStream that reads a file of type .file. This is part of a larger guide that discusses 15 ways to read files in Java.
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
public class ReadFile_BufferedInputStream_Read {
public static void main(String [] pArgs) throws FileNotFoundException, IOException {
String fileName = "c:\\temp\\sample-10KB.file";
File file = new File(fileName);
FileInputStream fileInputStream = new FileInputStream(file);
try (BufferedInputStream bufferedInputStream = new BufferedInputStream(fileInputStream)) {
int singleCharInt;
char singleChar;
while((singleCharInt = bufferedInputStream.read()) != -1) {
singleChar = (char) singleCharInt;
System.out.print(singleChar);
}
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading an ASCII file with FileChannel and ByteArrays - java

buf.getChar() is expecting 2 bytes per character but you are only storing 1. Use: System.out.print((char) buf.get());

Changing your print statement to: System.out.print((char)buf.get()); Seems to help.

Yes, it is Unicode. If you have 14 Chars in your File, you only get 7 '?'. Solution pending. Still thinking.

Related

Java Nio ByteBuffer truncate unicode characters when buffer reaches its bound

Character missing when use the InputStreamReader class in Java

How to convert a character into an integer in Java?

Sending string as byte array from C# to Java via socket

Read file in java

Categories

Resources