When reading from a file using readChar() in RandomAccessFile class, unexpected output comes.
Instead of the desired character ? is displayed.
package tesr;
import java.io.RandomAccessFile;
import java.io.IOException;
public class Test {
public static void main(String[] args) {
try{
RandomAccessFile f=new RandomAccessFile("c:\\ankit\\1.txt","rw");
f.seek(0);
System.out.println(f.readChar());
}
catch(IOException e){
System.out.println("dkndknf");
}
// TODO Auto-generated method stub
}
}
You probably intended readByte. Java char is UTF-16BE, a 2 bytes Unicode representation, and on random binary data very often not representable, no correct UTF-16BE or a half "surrogate" - part of a combination of two char forming one Unicode code point. Java represents a failed conversion in your case as question mark.
If you know in what encoding the file is in, then for a single byte encoding it is simple:
byte b = in.readByte();
byte[] bs = new byte[] { b };
String s = new String(bs, "Cp1252"); // Some single byte encoding
For the variable multi-byte UTF-8 it is also simple to identify a sequence of bytes:
single byte when high bit = 0
otherwise a continuation byte when high bits 10
otherwise a starting byte (with some special cases) telling the number of bytes by its high bits.
For UTF-16LE and UTF-16BE the file positions must be a multiple of 2 and 2 bytes long.
byte[] bs = new byte[2];
in.read(bs);
String s = new String(bs, StandardCharsets.UTF_16LE);
You almost certainly have a character encoding problem. It is not possible to simply read characters from a file. What must be done is that an appropriate sequence of bytes are read, then those bytes are interpreted according to a character encoding scheme to translate them to a character. When you want to read a file as text, Java must be told, perhaps implicitly, which character encoding to use.
If you tell Java the wrong encoding you will get gibberish. If you pick an arbitrary point in a file and start reading, and that location is not the start of the encoding of a character, you will get gibberish. One or both of those has happened in your case.
Related
I'm trying to write a code that pick-up a word from a file according to an index entered by the user but the problem is that the method readChar() from the RandomAccessFile class is returning japanese characters, I must admit that it's not the first time that I've seen this on my lenovo laptop , sometimes on some installation wizards I can see mixed stuff with normal characters mixed with japanese characters, do you think it comes from the laptop or rather from the code?
This is the code:
package com.project;
import java.io.*;
import java.util.StringTokenizer;
public class Main {
public static void main(String[] args) throws IOException {
int N, i=0;
char C;
char[] charArray = new char[100];
String fileLocation = "file.txt";
BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
do {
System.out.println("enter the index of the word");
N = Integer.parseInt(buffer.readLine());
if (N!=0) {
RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
do {
word.seek((2*(N-1))+i);
C = word.readChar();
charArray[i] = C;
i++;
}while(charArray[i-1] != ' ');
System.out.println("the word of index " + N + " is: " );
for (char carTemp : charArray )
System.out.print(carTemp);
System.out.print("\n");
}
}while(N!=0);
buffer.close();
}
}
i get this output :
瑯潕啰灰灥敲牃䍡慳獥攨⠩⤍ഊੴ瑯潌䱯潷睥敲牃䍡慳獥攨⠩⤍ഊ捯潭浣捡慴琨⡓却瑲物楮湧朩⤍ഊ捨桡慲牁䅴琨⡩楮湴琩⤍ഊੳ獵畢扳獴瑲物楮湧木⠠獴瑡慲牴琠楮湤摥數砬Ⱐ敮湤搠楮湤摥數砩⤍ഊੴ瑲物業洨⠩Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 100 out of bounds for length 100
at Main.main(Main.java:21)
There are many things wrong, all of which have to do with fundamental misconceptions.
First off: A file on your disk - never mind the File interface in Java, or any other programming language; the file itself - does not and cannot store text. Ever. It stores bytes. That is, raw data, as (on every machine that's been relevant for decades, but historically there have been other ways to do it) quantified in bits, which are organized into groups of 8 that are called bytes.
Text is an abstraction; an interpretation of some particular sequence of byte values. It depends - fundamentally and unavoidably - on an encoding. Because this isn't a blog, I'll spare you the history lesson here, but suffice to say that Java's char type does not simply store a character of text. It stores an unsigned two-byte value, which may represent a character of text. Because there are more characters of text in Unicode than two bytes can represent, sometimes two adjacent chars in an array are required to represent a character of text. (And, of course, there is probably code out there that abuses the char type simply because someone wanted an unsigned equivalent of short. I may even have written some myself. That era is a blur for me.)
Anyway, the point is: using .readChar() is going to read two bytes from your file, and store them into a char within your char[], and the corresponding numeric value is not going to be anything like the one you wanted - unless your file happens to be encoded using the same encoding that Java uses natively, called UTF-16.
You cannot properly read and interpret the file without knowing the file encoding. Full stop. You can at best delude yourself into believing that you can read it. You also cannot have "random access" to a text file - i.e., indexing according to a number of characters of text - unless the encoding in question is constant width. (Otherwise, of course, you can't just calculate the distance-in-bytes into the file where a given character of text is; it depends on how many bytes the previous characters took up, which depends on which characters they are.) Many text encodings are not constant width. One of the most popular, which frankly is the sane default recommendation for most tasks these days, is not. In which case you are simply out of luck for the problem you describe.
At any rate, once you know the encoding of your file, the expected way to retrieve a character of text from a file in Java is to use one of the Reader classes, such as InputStreamReader:
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
(Here, charset simply means an instance of the class that Java uses to represent text encodings.)
You may be able to fudge your problem description a little bit: seek to a byte offset, and then grab the text characters starting at that offset. However, there is no guarantee that the "text characters starting at that offset" make any sense, or in fact can be decoded at all. If the offset happens to be in the middle of a multi-byte encoding for a character, the remaining part isn't necessarily valid encoded text.
char is 16 bits, i.e. 2 bytes.
seek seeks to a byte boundary.
If the file contains chars then they are at even offsets: 0, 2, 4...
The expression (2*(N-1))+i) is even iff i is even; if odd, you are sure to land in the middle of a char, and thus read garbage.
i starts at zero, but you increment by 1, i.e., half a character.
Your seek argument should probably be (2*(N-1+i)).
Alternative explanation: your file does not contain chars at all; for example, you created an ASCII file in which a character is a single byte.
In that case, the error is attempting to read ASCII (an obsolete character encoding) with a readChar function.
But if the file contains ASCII, the purpose of multiplying by 2 in the seek argument is obscure. It apparently serves no useful purpose.
I changed the encoding of the file to UTF-16 and modified the programe in order to display the right indexes, those that represents the beginning of each word, now it works fine, Thank you guys.
import java.io.*;
public class Main {
public static void main(String[] args) throws IOException {
int N, i=0, j=0, k=0;
char C;
char[] charArray = new char[100];
String fileLocation = "file.txt";
BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
DataInputStream in = new DataInputStream(new FileInputStream(fileLocation));
boolean EOF=false;
do {
try {
j++;
C = in.readChar();
if((C==' ')||(C=='\n')){
System.out.print(j+1+"\t");
}
}catch (IOException e){
EOF=true;
}
}while (EOF!=true);
System.out.println("\n");
do {
System.out.println("enter the index of the word");
N = Integer.parseInt(buffer.readLine());
if (N!=0) {
RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
do {
word.seek((2*(N-1+i)));
C = word.readChar();
charArray[i] = C;
i++;
}while(charArray[i-1] != ' ' && charArray[i-1] != '\n');
System.out.print("the word of index " + N + " is: " );
for (char carTemp : charArray )
System.out.print(carTemp);
System.out.print("\n");
i=0;
charArray = new char[100];
}
}while(N!=0);
buffer.close();
}
}
I'm developing a JPEG decoder(I'm in the Huffman phase) and I want to write BinaryString's into a file.
For example, let's say we've this:
String huff = "00010010100010101000100100";
I've tried to convert it to an integer spliting it by 8 and saving it integer represantation, as I can't write bits:
huff.split("(?<=\\G.{8})"))
int val = Integer.parseInt(str, 2);
out.write(val); //writes to a FileOutputStream
The problem is that, in my example, if I try to save "00010010" it converts it to 18 (10010), and I need the 0's.
And finally, when I read :
int enter;
String code = "";
while((enter =in.read())!=-1) {
code+=Integer.toBinaryString(enter);
}
I got :
Code = 10010
instead of:
Code = 00010010
Also I've tried to convert it to bitset and then to Byte[] but I've the same problem.
Your example is that you have the string "10010" and you want the string "00010010". That is, you need to left-pad this string with zeroes. Note that since you're joining the results of many calls to Integer.toBinaryString in a loop, you need to left-pad these strings inside the loop, before concatenating them.
while((enter = in.read()) != -1) {
String binary = Integer.toBinaryString(enter);
// left-pad to length 8
binary = ("00000000" + binary).substring(binary.length());
code += binary;
}
You might want to look at the UTF-8 algorithm, since it does exactly what you want. It stores massive amounts of data while discarding zeros, keeping relevant data, and encoding it to take up less disk space.
Works with: Java version 7+
import java.nio.charset.StandardCharsets;
import java.util.Formatter;
public class UTF8EncodeDecode {
public static byte[] utf8encode(int codepoint) {
return new String(new int[]{codepoint}, 0, 1).getBytes(StandardCharsets.UTF_8);
}
public static int utf8decode(byte[] bytes) {
return new String(bytes, StandardCharsets.UTF_8).codePointAt(0);
}
public static void main(String[] args) {
System.out.printf("%-7s %-43s %7s\t%s\t%7s%n",
"Char", "Name", "Unicode", "UTF-8 encoded", "Decoded");
for (int codepoint : new int[]{0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}) {
byte[] encoded = utf8encode(codepoint);
Formatter formatter = new Formatter();
for (byte b : encoded) {
formatter.format("%02X ", b);
}
String encodedHex = formatter.toString();
int decoded = utf8decode(encoded);
System.out.printf("%-7c %-43s U+%04X\t%-12s\tU+%04X%n",
codepoint, Character.getName(codepoint), codepoint, encodedHex, decoded);
}
}
}
https://rosettacode.org/wiki/UTF-8_encode_and_decode#Java
UTF-8 is a variable width character encoding capable of encoding all 1,112,064[nb 1] valid code points in Unicode using one to four 8-bit bytes.[nb 2] The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike.[1][2] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[3]
It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" (slash) in filenames, "\" (backslash) in escape sequences, and "%" in printf.
https://en.wikipedia.org/wiki/UTF-8
Binary 11110000 10010000 10001101 10001000 becomes F0 90 8D 88 in UTF-8. Since you are storing it as text, you go from having to store 32 characters to storing 8. And because it's a well known and well designed encoding, you can reverse it easily. All the math is done for you.
Your example of 00010010100010101000100100 (or rather 00000001 0010100 0101010 00100100) converts to *$ (two unprintable characters on my machine). That's the UTF-8 encoding of the binary. I had mistakenly used a different site that was using the data I put in as decimal instead of binary.
https://onlineutf8tools.com/convert-binary-to-utf8
For a really good explanation of UTF-8 and how it can apply to the answer:
https://hackaday.com/2013/09/27/utf-8-the-most-elegant-hack/
Edit:
I took this question as a way to reduce the amount of characters needed to store values, which is a type of encoding. UTF-8 is a type of encoding. Used in a "non-standard" way, the OP can use UTF-8 to encode their strings of 0's & 1's in a much shorter format. That's how this answer is relevant.
If you concatenate the characters, you can go from 4x 8 bits (32 bits) to 8x 8 bits (64 bits) easily and encode a value as large as 9,223,372,036,854,775,807.
I use a RandomAccessFile object to read an UTF-8 French file. I use the readLine method.
My Groovy code below:
while ((line = randomAccess.readLine())) {
def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')
++count
long nextRecordPos = randomAccess.getFilePointer()
compareNextRecords(utfLine, randomAccess)
randomAccess.seek(nextRecordPos)
}
My problem is utfLine and line are the same: the accented characters stay like é instead of é. No conversion is done.
First of all, this line of code does absolutely nothing. The data is the same. Remove it:
def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')
According to the Javadoc, RandomAccessFile.readLine() is not aware of character encodings. It reads bytes until it encounters "\r" or "\n" or "\r\n". ASCII byte values are put into the returned string in the normal way. But byte values between 128 and 255 are put into the string literally without interpreting it as being in a character encoding (or you could say this is the raw/verbatim encoding).
There is no method or constructor to set the character encoding in a RandomAccessFile. But it's still valuable to use readLine() because it takes care of parsing for a newline sequence and allocating memory.
The easiest solution in your situation is to manually convert the fake "line" into bytes by reversing what readLine() did, then decode the bytes into a real string with awareness of character encoding. I don't know how to write code in Groovy, so I'll give the answer in Java:
String fakeLine = randomAccess.readLine();
byte[] bytes = new byte[fakeLine.length()];
for (int i = 0; i < fakeLine.length(); i++)
bytes[i] = (byte)fakeLine.charAt(i);
String realLine = new String(bytes, "UTF-8");
I'm writing a web application in Google app Engine. It allows people to basically edit html code that gets stored as an .html file in the blobstore.
I'm using fetchData to return a byte[] of all the characters in the file. I'm trying to print to an html in order for the user to edit the html code. Everything works great!
Here's my only problem now:
The byte array is having some issues when converting back to a string. Smart quotes and a couple of characters are coming out looking funky. (?'s or japanese symbols etc.) Specifically it's several bytes I'm seeing that have negative values which are causing the problem.
The smart quotes are coming back as -108 and -109 in the byte array. Why is this and how can I decode the negative bytes to show the correct character encoding?
The byte array contains characters in a special encoding (that you should know). The way to convert it to a String is:
String decoded = new String(bytes, "UTF-8"); // example for one encoding type
By The Way - the raw bytes appear may appear as negative decimals just because the java datatype byte is signed, it covers the range from -128 to 127.
-109 = 0x93: Control Code "Set Transmit State"
The value (-109) is a non-printable control character in UNICODE. So UTF-8 is not the correct encoding for that character stream.
0x93 in "Windows-1252" is the "smart quote" that you're looking for, so the Java name of that encoding is "Cp1252". The next line provides a test code:
System.out.println(new String(new byte[]{-109}, "Cp1252"));
Java 7 and above
You can also pass your desired encoding to the String constructor as a Charset constant from StandardCharsets. This may be safer than passing the encoding as a String, as suggested in the other answers.
For example, for UTF-8 encoding
String bytesAsString = new String(bytes, StandardCharsets.UTF_8);
You can try this.
String s = new String(bytearray);
public class Main {
/**
* Example method for converting a byte to a String.
*/
public void convertByteToString() {
byte b = 65;
//Using the static toString method of the Byte class
System.out.println(Byte.toString(b));
//Using simple concatenation with an empty String
System.out.println(b + "");
//Creating a byte array and passing it to the String constructor
System.out.println(new String(new byte[] {b}));
}
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
new Main().convertByteToString();
}
}
Output
65
65
A
public static String readFile(String fn) throws IOException
{
File f = new File(fn);
byte[] buffer = new byte[(int)f.length()];
FileInputStream is = new FileInputStream(fn);
is.read(buffer);
is.close();
return new String(buffer, "UTF-8"); // use desired encoding
}
I suggest Arrays.toString(byte_array);
It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.
The previous answer from Andreas_D is good. I'm just going to add that wherever you are displaying the output there will be a font and a character encoding and it may not support some characters.
To work out whether it is Java or your display that is a problem, do this:
for(int i=0;i<str.length();i++) {
char ch = str.charAt(i);
System.out.println(i+" : "+ch+" "+Integer.toHexString(ch)+((ch=='\ufffd') ? " Unknown character" : ""));
}
Java will have mapped any characters it cannot understand to 0xfffd the official character for unknown characters. If you see a '?' in the output, but it is not mapped to 0xfffd, it is your display font or encoding that is the problem, not Java.
I have an InputStream and I want to read each char until I find a comma "," from a socket.
Heres my code
private static Packet readPacket(InputStream is) throws Exception
{
int ch;
Packet p = new Packet();
String type = "";
while((ch = is.read()) != 44) //44 is the "," in ISO-8859-1 codification
{
if(ch == -1)
throw new IOException("EOF");
type += new String(ch, "ISO-8859-1"); //<----DOES NOT COMPILE
}
...
}
String constructor does not receive an int, only an array of bytes. I read the documentation and the it says
read():
Reads the next byte of data from the input stream.
How can I convert this int to byte then ? Is it using only the less significant bits (8 bits) of all 32 bits of the int ?
Since Im working with Java, I want to keep it full plataform compatible (little endian vs big endian, etc...) Whats the best approach here and why ?
PS: I dont want to use any ready-to-use classes like DataInputStream, etc....
The String constructor takes a char[] (an array)
type += new String(new byte[] { (byte) ch }, "ISO-8859-1");
Btw. it would be more elegant to use a StringBuilder for type and make use of its append-methods. Its faster and also shows the intend better:
private static Packet readPacket(InputStream is) throws Exception {
int ch;
Packet p = new Packet();
StringBuilder type = new StringBuilder();
while((ch = is.read()) != 44) {
if(ch == -1)
throw new IOException("EOF");
// NOTE: conversion from byte to char here is iffy, this works for ISO8859-1/US-ASCII
// but fails horribly for UTF etc.
type.append((char) ch);
}
String data = type.toString();
...
}
Also, to make it more flexible (e.g. work with other character encodings), your method would better take an InputStreamReader that handles the conversion from bytes to characters for you (take look at InputStreamReader(InputStream, Charset) constructor's javadoc).
For this can use an InputStreamReader, which can read encoded character data from a raw byte stream:
InputStreamReader reader = new InputStreamReader(is, "ISO-8859-1");
You may now use reader.read(), which will consume the correct number of bytes from is, decode as ISO-8859-1, and return a Unicode code point that can be correctly cast to a char.
Edit: Responding to comment about not using any "ready-to-use" classes:
I don't know if InputStreamReader counts. If it does, check out Durandal's answer, which is sufficient for certain single byte encodings (like US-ASCII, arguable, or ISO-8859-1).
For multibyte encodings, if you do not want to use any other classes, you would first buffer all data into a byte[] array, then construct a String from that.
Edit: Responding to a related question in the comments on Abhishek's answer.
Q:
Abhishek wrote: Can you please enlighten me a little more? i have tried casting integer ASCII to character..it has worked..can you kindly tell where did i go wrong?
A:
You didn't go "wrong", per se. The reason ASCII works is the same reason that Brian pointed out that ISO-8859-1 works. US-ASCII is a single byte encoding, and bytes 0x00-0x7f have the same value as their corresponding Unicode code points. So a cast to char is conceptually incorrect, but in practice, since the values are the same, it works. Same with ISO-8859-1; bytes 0x00-0xff have the same value as their corresponding code points in that encoding. A cast to char would not work in e.g. IBM01141 (a single byte encoding but with different values).
And, of course, a single byte to char cast would not work for multibyte encodings like UTF-16, as more than one input byte must be read (a variable number, in fact) to determine the correct value of a corresponding char.
type += new String(String.valueOf(ch).getBytes("ISO-8859-1"));
Partial answer: Try replacing :
type += new String(ch, "ISO-8859-1");
by
type+=(char)ch;
This can be done if you receive the ASCII value of the char.Code converts ASCII in to char by casting.
Its better to avoid lengthy code and this would work just fine. The read() function works in many ways:
One way is: int= inpstr.read();
Second inpstr.read(byte)
So its up to you which method you wanna use.. both have different purpose..