RandomAccesFile and UTF8 line - java

I use a RandomAccessFile object to read an UTF-8 French file. I use the readLine method.
My Groovy code below:
while ((line = randomAccess.readLine())) {
def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')
++count
long nextRecordPos = randomAccess.getFilePointer()
compareNextRecords(utfLine, randomAccess)
randomAccess.seek(nextRecordPos)
}
My problem is utfLine and line are the same: the accented characters stay like é instead of é. No conversion is done.

First of all, this line of code does absolutely nothing. The data is the same. Remove it:
def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')
According to the Javadoc, RandomAccessFile.readLine() is not aware of character encodings. It reads bytes until it encounters "\r" or "\n" or "\r\n". ASCII byte values are put into the returned string in the normal way. But byte values between 128 and 255 are put into the string literally without interpreting it as being in a character encoding (or you could say this is the raw/verbatim encoding).
There is no method or constructor to set the character encoding in a RandomAccessFile. But it's still valuable to use readLine() because it takes care of parsing for a newline sequence and allocating memory.
The easiest solution in your situation is to manually convert the fake "line" into bytes by reversing what readLine() did, then decode the bytes into a real string with awareness of character encoding. I don't know how to write code in Groovy, so I'll give the answer in Java:
String fakeLine = randomAccess.readLine();
byte[] bytes = new byte[fakeLine.length()];
for (int i = 0; i < fakeLine.length(); i++)
bytes[i] = (byte)fakeLine.charAt(i);
String realLine = new String(bytes, "UTF-8");

Related

Java NIO server receives random string [duplicate]

I'm writing a web application in Google app Engine. It allows people to basically edit html code that gets stored as an .html file in the blobstore.
I'm using fetchData to return a byte[] of all the characters in the file. I'm trying to print to an html in order for the user to edit the html code. Everything works great!
Here's my only problem now:
The byte array is having some issues when converting back to a string. Smart quotes and a couple of characters are coming out looking funky. (?'s or japanese symbols etc.) Specifically it's several bytes I'm seeing that have negative values which are causing the problem.
The smart quotes are coming back as -108 and -109 in the byte array. Why is this and how can I decode the negative bytes to show the correct character encoding?
The byte array contains characters in a special encoding (that you should know). The way to convert it to a String is:
String decoded = new String(bytes, "UTF-8"); // example for one encoding type
By The Way - the raw bytes appear may appear as negative decimals just because the java datatype byte is signed, it covers the range from -128 to 127.
-109 = 0x93: Control Code "Set Transmit State"
The value (-109) is a non-printable control character in UNICODE. So UTF-8 is not the correct encoding for that character stream.
0x93 in "Windows-1252" is the "smart quote" that you're looking for, so the Java name of that encoding is "Cp1252". The next line provides a test code:
System.out.println(new String(new byte[]{-109}, "Cp1252"));
Java 7 and above
You can also pass your desired encoding to the String constructor as a Charset constant from StandardCharsets. This may be safer than passing the encoding as a String, as suggested in the other answers.
For example, for UTF-8 encoding
String bytesAsString = new String(bytes, StandardCharsets.UTF_8);
You can try this.
String s = new String(bytearray);
public class Main {
/**
* Example method for converting a byte to a String.
*/
public void convertByteToString() {
byte b = 65;
//Using the static toString method of the Byte class
System.out.println(Byte.toString(b));
//Using simple concatenation with an empty String
System.out.println(b + "");
//Creating a byte array and passing it to the String constructor
System.out.println(new String(new byte[] {b}));
}
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
new Main().convertByteToString();
}
}
Output
65
65
A
public static String readFile(String fn) throws IOException
{
File f = new File(fn);
byte[] buffer = new byte[(int)f.length()];
FileInputStream is = new FileInputStream(fn);
is.read(buffer);
is.close();
return new String(buffer, "UTF-8"); // use desired encoding
}
I suggest Arrays.toString(byte_array);
It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.
The previous answer from Andreas_D is good. I'm just going to add that wherever you are displaying the output there will be a font and a character encoding and it may not support some characters.
To work out whether it is Java or your display that is a problem, do this:
for(int i=0;i<str.length();i++) {
char ch = str.charAt(i);
System.out.println(i+" : "+ch+" "+Integer.toHexString(ch)+((ch=='\ufffd') ? " Unknown character" : ""));
}
Java will have mapped any characters it cannot understand to 0xfffd the official character for unknown characters. If you see a '?' in the output, but it is not mapped to 0xfffd, it is your display font or encoding that is the problem, not Java.

Why aren't these two string equals?

I am sending a packet through UDP and for some reason I can't compare the string I extract from the packet and the string I create even though the values are the same when I print them (no trailing white spaces).
byte[] incoming = new byte[1000];
DatagramPacket request = new DatagramPacket(incoming, incoming.length);
serverSocket.receive(request);
String str = new String(request.getData());
String str2 = new String("message received");
if(str.equals(str2))
{
System.out.println("equal");
}
Is there any reason for this?
This occurs because new String(request.getData()) does not return "message received".
The problem is [likely] due to the fact that new String(byte[]) attempts to use all (1000 of) the bytes supplied, in the default encoding, which ends with a bunch of NUL ('\0') characters that append to the actual string content making it not equal with the literal. Such can be easily seen a debugger, although such NUL characters are often "lost" when displaying as normal text as with println.
Trivially: "hello".equals("hello\0") is false.
Several solutions include:
Frame the string, such as prefixing the sent data with the number of bytes that make up the string, and then using a String constructor that takes a limit/length or;
Prevent any trailing 0 from being processed, again by specifiying the limit to decode or;
Remove any NUL characters after decoding the data.
Since option #3 is easy1 (until it can be fixed to use #1/#2), consider:
String str = new String(request.getData(), "UTF-8"); // Specify an encoding!
int nul = str.indexOf('\0');
if (nul > -1) {
str = str.substring(0, nul);
}
1 While trimming is the easiest, it is not generally appropriate. The biggest problem with #3 over #2 is it first decodes all the bytes and then filters the characters. Under different encodings (although ASCII and UTF-8 should be "safe"), this may result in non-NUL garbage after the actual string content depending upon what exists in the buffer.
Also, specify an encoding manually to new String(byte[] ..) or String.getBytes(..). Otherwise the "default encoding" will be used, which can cause problems if the different systems are using a different default.

Get bytes from the Int returned from socket intputStream read()

I have an InputStream and I want to read each char until I find a comma "," from a socket.
Heres my code
private static Packet readPacket(InputStream is) throws Exception
{
int ch;
Packet p = new Packet();
String type = "";
while((ch = is.read()) != 44) //44 is the "," in ISO-8859-1 codification
{
if(ch == -1)
throw new IOException("EOF");
type += new String(ch, "ISO-8859-1"); //<----DOES NOT COMPILE
}
...
}
String constructor does not receive an int, only an array of bytes. I read the documentation and the it says
read():
Reads the next byte of data from the input stream.
How can I convert this int to byte then ? Is it using only the less significant bits (8 bits) of all 32 bits of the int ?
Since Im working with Java, I want to keep it full plataform compatible (little endian vs big endian, etc...) Whats the best approach here and why ?
PS: I dont want to use any ready-to-use classes like DataInputStream, etc....
The String constructor takes a char[] (an array)
type += new String(new byte[] { (byte) ch }, "ISO-8859-1");
Btw. it would be more elegant to use a StringBuilder for type and make use of its append-methods. Its faster and also shows the intend better:
private static Packet readPacket(InputStream is) throws Exception {
int ch;
Packet p = new Packet();
StringBuilder type = new StringBuilder();
while((ch = is.read()) != 44) {
if(ch == -1)
throw new IOException("EOF");
// NOTE: conversion from byte to char here is iffy, this works for ISO8859-1/US-ASCII
// but fails horribly for UTF etc.
type.append((char) ch);
}
String data = type.toString();
...
}
Also, to make it more flexible (e.g. work with other character encodings), your method would better take an InputStreamReader that handles the conversion from bytes to characters for you (take look at InputStreamReader(InputStream, Charset) constructor's javadoc).
For this can use an InputStreamReader, which can read encoded character data from a raw byte stream:
InputStreamReader reader = new InputStreamReader(is, "ISO-8859-1");
You may now use reader.read(), which will consume the correct number of bytes from is, decode as ISO-8859-1, and return a Unicode code point that can be correctly cast to a char.
Edit: Responding to comment about not using any "ready-to-use" classes:
I don't know if InputStreamReader counts. If it does, check out Durandal's answer, which is sufficient for certain single byte encodings (like US-ASCII, arguable, or ISO-8859-1).
For multibyte encodings, if you do not want to use any other classes, you would first buffer all data into a byte[] array, then construct a String from that.
Edit: Responding to a related question in the comments on Abhishek's answer.
Q:
Abhishek wrote: Can you please enlighten me a little more? i have tried casting integer ASCII to character..it has worked..can you kindly tell where did i go wrong?
A:
You didn't go "wrong", per se. The reason ASCII works is the same reason that Brian pointed out that ISO-8859-1 works. US-ASCII is a single byte encoding, and bytes 0x00-0x7f have the same value as their corresponding Unicode code points. So a cast to char is conceptually incorrect, but in practice, since the values are the same, it works. Same with ISO-8859-1; bytes 0x00-0xff have the same value as their corresponding code points in that encoding. A cast to char would not work in e.g. IBM01141 (a single byte encoding but with different values).
And, of course, a single byte to char cast would not work for multibyte encodings like UTF-16, as more than one input byte must be read (a variable number, in fact) to determine the correct value of a corresponding char.
type += new String(String.valueOf(ch).getBytes("ISO-8859-1"));
Partial answer: Try replacing :
type += new String(ch, "ISO-8859-1");
by
type+=(char)ch;
This can be done if you receive the ASCII value of the char.Code converts ASCII in to char by casting.
Its better to avoid lengthy code and this would work just fine. The read() function works in many ways:
One way is: int= inpstr.read();
Second inpstr.read(byte)
So its up to you which method you wanna use.. both have different purpose..

Decode bytes to chars one at a time

I have an arbitrary chunk of bytes that represent chars, encoded in an arbitrary scheme (may be ASCII, UTF-8, UTF-16). I know the encoding.
What I'm trying to do is find the location of the last new line (\n) in the array of bytes. I want to know how many bytes are left over after reading the last encoded \n.
I can't find anything in the JDK or any other library that will let me convert a byte array to chars one by one. InputStreamReader reads the stream in chunks, not giving me any indication how many bytes are getting read to produce a char.
Am I going to have to do something as horrible are re-encoding each char to figure out its byte length?
You can try something like this
CharsetDecoder cd = Charset.forName("UTF-8").newDecoder();
ByteBuffer in = ByteBuffer.wrap(bytes);
CharBuffer out = CharBuffer.allocate(1);
int p = 0;
while (in.hasRemaining()) {
cd.decode(in, out, true);
char c = out.array()[0];
int nBytes = in.position() - p;
p = in.position();
out.position(0);
}

How to read fixed length character using BufferReader or other method?

If I use BufferReader to read a line, I can get a string of a line.
The code is this :
FileInputStream fs = new FileInputStream("E:\\tmp\\aaa.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fs));
String line = null;
while ((line = br.readLine()) != null) {
System.out.println(line.length() + " " +line.substring(0, 2);
}
The contents of aaa.txt is :
一二三四1234
so. the result of running the code is :
8 一二
From the result , I know the length of a chinese character in String is one, not two.
So If I use line.substring(0,2), I get two chinese character "一二".
But I hope that, the result of line.substring(0,2) is "一".
I mean that, in my eye the length of "一二三四1234" is 12, not 8.I can use substring(0,2) to
extract fixed length character.
Thanks in advance.
From the result , I know the length of a chinese character in String is one, not two.
Thats right so every sign is a char, so the length of these "一二三四1234" string is 8
so why 12?
I mean that, in my eye the length of "一二三四1234" is 12, not 8.I can use substring(0,2) to extract fixed length character.
if you know the index of the char you want you can use the code below instead:
String s = "一二三四1234";
char c = s.charAt(0);
Because the method subString creates a new String from the index 0 to 2
Java use unicode as internal charset, so any char type is unicode.And java.lang.String is consist of chars.
When you get string from reader,the bytes content of file is already translated to chars base on encoding of file.
line.substring(0, 2) result in a new string with fisrt two chars of the line to be returned,that's what you already got!
I guess "in my eye the length " mean by you see that in a text editor like UltraEdit,maybe
editor just show the position of bytes in the file
If I use line.substring(0,2), I get two chinese character "一二".
So you got two characters. That's what you asked for. The two characters at index 0 and 1.
But I hope that, the result of line.substring(0,2) is "一".
If you only want one character, ask for one character. The character at index 0. line.substring(0,1) for example.
First you need to decode the file using chinsese encodings, such as GBK, GB2312, etc.
Read the line into byte array and then convert that byte array into string using chinese encodings.
FileInputStream fileStream=new FileInputStream(New
File("sometext.txt"));
byte[] buf=new byte[12];
byte[] line=reader.read(buf);
byte[] byteRange=Arrays.copyOfRange(allBytes,0,2));
String chineseString=new String(byteRange,Charset.forName("GBK"));
This way you will only get 1 chinese character. There are only 1 step of conversion from GBK to UTF-8.
Oh, yeah! An improvement from the previous method.

Categories