How to read fixed length character using BufferReader or other method? - java

If I use BufferReader to read a line, I can get a string of a line.
The code is this :
FileInputStream fs = new FileInputStream("E:\\tmp\\aaa.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fs));
String line = null;
while ((line = br.readLine()) != null) {
System.out.println(line.length() + " " +line.substring(0, 2);
}
The contents of aaa.txt is :
一二三四1234
so. the result of running the code is :
8 一二
From the result , I know the length of a chinese character in String is one, not two.
So If I use line.substring(0,2), I get two chinese character "一二".
But I hope that, the result of line.substring(0,2) is "一".
I mean that, in my eye the length of "一二三四1234" is 12, not 8.I can use substring(0,2) to
extract fixed length character.
Thanks in advance.

From the result , I know the length of a chinese character in String is one, not two.
Thats right so every sign is a char, so the length of these "一二三四1234" string is 8
so why 12?
I mean that, in my eye the length of "一二三四1234" is 12, not 8.I can use substring(0,2) to extract fixed length character.
if you know the index of the char you want you can use the code below instead:
String s = "一二三四1234";
char c = s.charAt(0);
Because the method subString creates a new String from the index 0 to 2

Java use unicode as internal charset, so any char type is unicode.And java.lang.String is consist of chars.
When you get string from reader,the bytes content of file is already translated to chars base on encoding of file.
line.substring(0, 2) result in a new string with fisrt two chars of the line to be returned,that's what you already got!
I guess "in my eye the length " mean by you see that in a text editor like UltraEdit,maybe
editor just show the position of bytes in the file

If I use line.substring(0,2), I get two chinese character "一二".
So you got two characters. That's what you asked for. The two characters at index 0 and 1.
But I hope that, the result of line.substring(0,2) is "一".
If you only want one character, ask for one character. The character at index 0. line.substring(0,1) for example.

First you need to decode the file using chinsese encodings, such as GBK, GB2312, etc.
Read the line into byte array and then convert that byte array into string using chinese encodings.
FileInputStream fileStream=new FileInputStream(New
File("sometext.txt"));
byte[] buf=new byte[12];
byte[] line=reader.read(buf);
byte[] byteRange=Arrays.copyOfRange(allBytes,0,2));
String chineseString=new String(byteRange,Charset.forName("GBK"));
This way you will only get 1 chinese character. There are only 1 step of conversion from GBK to UTF-8.
Oh, yeah! An improvement from the previous method.

Related

The readChar() method displays japanese character

I'm trying to write a code that pick-up a word from a file according to an index entered by the user but the problem is that the method readChar() from the RandomAccessFile class is returning japanese characters, I must admit that it's not the first time that I've seen this on my lenovo laptop , sometimes on some installation wizards I can see mixed stuff with normal characters mixed with japanese characters, do you think it comes from the laptop or rather from the code?
This is the code:
package com.project;
import java.io.*;
import java.util.StringTokenizer;
public class Main {
public static void main(String[] args) throws IOException {
int N, i=0;
char C;
char[] charArray = new char[100];
String fileLocation = "file.txt";
BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
do {
System.out.println("enter the index of the word");
N = Integer.parseInt(buffer.readLine());
if (N!=0) {
RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
do {
word.seek((2*(N-1))+i);
C = word.readChar();
charArray[i] = C;
i++;
}while(charArray[i-1] != ' ');
System.out.println("the word of index " + N + " is: " );
for (char carTemp : charArray )
System.out.print(carTemp);
System.out.print("\n");
}
}while(N!=0);
buffer.close();
}
}
i get this output :
瑯潕啰灰灥敲牃䍡慳獥攨⠩⤍ഊੴ瑯潌䱯潷睥敲牃䍡慳獥攨⠩⤍ഊ੣捯潭浣捡慴琨⡓却瑲物楮湧朩⤍ഊ੣捨桡慲牁䅴琨⡩楮湴琩⤍ഊੳ獵畢扳獴瑲物楮湧木⠠⁳獴瑡慲牴琠⁩楮湤摥數砬Ⱐ⁥敮湤搠⁩楮湤摥數砩⤍ഊੴ瑲物業洨⠩Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 100 out of bounds for length 100
at Main.main(Main.java:21)
There are many things wrong, all of which have to do with fundamental misconceptions.
First off: A file on your disk - never mind the File interface in Java, or any other programming language; the file itself - does not and cannot store text. Ever. It stores bytes. That is, raw data, as (on every machine that's been relevant for decades, but historically there have been other ways to do it) quantified in bits, which are organized into groups of 8 that are called bytes.
Text is an abstraction; an interpretation of some particular sequence of byte values. It depends - fundamentally and unavoidably - on an encoding. Because this isn't a blog, I'll spare you the history lesson here, but suffice to say that Java's char type does not simply store a character of text. It stores an unsigned two-byte value, which may represent a character of text. Because there are more characters of text in Unicode than two bytes can represent, sometimes two adjacent chars in an array are required to represent a character of text. (And, of course, there is probably code out there that abuses the char type simply because someone wanted an unsigned equivalent of short. I may even have written some myself. That era is a blur for me.)
Anyway, the point is: using .readChar() is going to read two bytes from your file, and store them into a char within your char[], and the corresponding numeric value is not going to be anything like the one you wanted - unless your file happens to be encoded using the same encoding that Java uses natively, called UTF-16.
You cannot properly read and interpret the file without knowing the file encoding. Full stop. You can at best delude yourself into believing that you can read it. You also cannot have "random access" to a text file - i.e., indexing according to a number of characters of text - unless the encoding in question is constant width. (Otherwise, of course, you can't just calculate the distance-in-bytes into the file where a given character of text is; it depends on how many bytes the previous characters took up, which depends on which characters they are.) Many text encodings are not constant width. One of the most popular, which frankly is the sane default recommendation for most tasks these days, is not. In which case you are simply out of luck for the problem you describe.
At any rate, once you know the encoding of your file, the expected way to retrieve a character of text from a file in Java is to use one of the Reader classes, such as InputStreamReader:
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
(Here, charset simply means an instance of the class that Java uses to represent text encodings.)
You may be able to fudge your problem description a little bit: seek to a byte offset, and then grab the text characters starting at that offset. However, there is no guarantee that the "text characters starting at that offset" make any sense, or in fact can be decoded at all. If the offset happens to be in the middle of a multi-byte encoding for a character, the remaining part isn't necessarily valid encoded text.
char is 16 bits, i.e. 2 bytes.
seek seeks to a byte boundary.
If the file contains chars then they are at even offsets: 0, 2, 4...
The expression (2*(N-1))+i) is even iff i is even; if odd, you are sure to land in the middle of a char, and thus read garbage.
i starts at zero, but you increment by 1, i.e., half a character.
Your seek argument should probably be (2*(N-1+i)).
Alternative explanation: your file does not contain chars at all; for example, you created an ASCII file in which a character is a single byte.
In that case, the error is attempting to read ASCII (an obsolete character encoding) with a readChar function.
But if the file contains ASCII, the purpose of multiplying by 2 in the seek argument is obscure. It apparently serves no useful purpose.
I changed the encoding of the file to UTF-16 and modified the programe in order to display the right indexes, those that represents the beginning of each word, now it works fine, Thank you guys.
import java.io.*;
public class Main {
public static void main(String[] args) throws IOException {
int N, i=0, j=0, k=0;
char C;
char[] charArray = new char[100];
String fileLocation = "file.txt";
BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
DataInputStream in = new DataInputStream(new FileInputStream(fileLocation));
boolean EOF=false;
do {
try {
j++;
C = in.readChar();
if((C==' ')||(C=='\n')){
System.out.print(j+1+"\t");
}
}catch (IOException e){
EOF=true;
}
}while (EOF!=true);
System.out.println("\n");
do {
System.out.println("enter the index of the word");
N = Integer.parseInt(buffer.readLine());
if (N!=0) {
RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
do {
word.seek((2*(N-1+i)));
C = word.readChar();
charArray[i] = C;
i++;
}while(charArray[i-1] != ' ' && charArray[i-1] != '\n');
System.out.print("the word of index " + N + " is: " );
for (char carTemp : charArray )
System.out.print(carTemp);
System.out.print("\n");
i=0;
charArray = new char[100];
}
}while(N!=0);
buffer.close();
}
}

RandomAccesFile and UTF8 line

I use a RandomAccessFile object to read an UTF-8 French file. I use the readLine method.
My Groovy code below:
while ((line = randomAccess.readLine())) {
def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')
++count
long nextRecordPos = randomAccess.getFilePointer()
compareNextRecords(utfLine, randomAccess)
randomAccess.seek(nextRecordPos)
}
My problem is utfLine and line are the same: the accented characters stay like é instead of é. No conversion is done.
First of all, this line of code does absolutely nothing. The data is the same. Remove it:
def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')
According to the Javadoc, RandomAccessFile.readLine() is not aware of character encodings. It reads bytes until it encounters "\r" or "\n" or "\r\n". ASCII byte values are put into the returned string in the normal way. But byte values between 128 and 255 are put into the string literally without interpreting it as being in a character encoding (or you could say this is the raw/verbatim encoding).
There is no method or constructor to set the character encoding in a RandomAccessFile. But it's still valuable to use readLine() because it takes care of parsing for a newline sequence and allocating memory.
The easiest solution in your situation is to manually convert the fake "line" into bytes by reversing what readLine() did, then decode the bytes into a real string with awareness of character encoding. I don't know how to write code in Groovy, so I'll give the answer in Java:
String fakeLine = randomAccess.readLine();
byte[] bytes = new byte[fakeLine.length()];
for (int i = 0; i < fakeLine.length(); i++)
bytes[i] = (byte)fakeLine.charAt(i);
String realLine = new String(bytes, "UTF-8");

Java NIO server receives random string [duplicate]

I'm writing a web application in Google app Engine. It allows people to basically edit html code that gets stored as an .html file in the blobstore.
I'm using fetchData to return a byte[] of all the characters in the file. I'm trying to print to an html in order for the user to edit the html code. Everything works great!
Here's my only problem now:
The byte array is having some issues when converting back to a string. Smart quotes and a couple of characters are coming out looking funky. (?'s or japanese symbols etc.) Specifically it's several bytes I'm seeing that have negative values which are causing the problem.
The smart quotes are coming back as -108 and -109 in the byte array. Why is this and how can I decode the negative bytes to show the correct character encoding?
The byte array contains characters in a special encoding (that you should know). The way to convert it to a String is:
String decoded = new String(bytes, "UTF-8"); // example for one encoding type
By The Way - the raw bytes appear may appear as negative decimals just because the java datatype byte is signed, it covers the range from -128 to 127.
-109 = 0x93: Control Code "Set Transmit State"
The value (-109) is a non-printable control character in UNICODE. So UTF-8 is not the correct encoding for that character stream.
0x93 in "Windows-1252" is the "smart quote" that you're looking for, so the Java name of that encoding is "Cp1252". The next line provides a test code:
System.out.println(new String(new byte[]{-109}, "Cp1252"));
Java 7 and above
You can also pass your desired encoding to the String constructor as a Charset constant from StandardCharsets. This may be safer than passing the encoding as a String, as suggested in the other answers.
For example, for UTF-8 encoding
String bytesAsString = new String(bytes, StandardCharsets.UTF_8);
You can try this.
String s = new String(bytearray);
public class Main {
/**
* Example method for converting a byte to a String.
*/
public void convertByteToString() {
byte b = 65;
//Using the static toString method of the Byte class
System.out.println(Byte.toString(b));
//Using simple concatenation with an empty String
System.out.println(b + "");
//Creating a byte array and passing it to the String constructor
System.out.println(new String(new byte[] {b}));
}
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
new Main().convertByteToString();
}
}
Output
65
65
A
public static String readFile(String fn) throws IOException
{
File f = new File(fn);
byte[] buffer = new byte[(int)f.length()];
FileInputStream is = new FileInputStream(fn);
is.read(buffer);
is.close();
return new String(buffer, "UTF-8"); // use desired encoding
}
I suggest Arrays.toString(byte_array);
It depends on your purpose. For example, I wanted to save a byte array exactly like the format you can see at time of debug that is something like this : [1, 2, 3] If you want to save exactly same value without converting the bytes to character format, Arrays.toString (byte_array) does this,. But if you want to save characters instead of bytes, you should use String s = new String(byte_array). In this case, s is equal to equivalent of [1, 2, 3] in format of character.
The previous answer from Andreas_D is good. I'm just going to add that wherever you are displaying the output there will be a font and a character encoding and it may not support some characters.
To work out whether it is Java or your display that is a problem, do this:
for(int i=0;i<str.length();i++) {
char ch = str.charAt(i);
System.out.println(i+" : "+ch+" "+Integer.toHexString(ch)+((ch=='\ufffd') ? " Unknown character" : ""));
}
Java will have mapped any characters it cannot understand to 0xfffd the official character for unknown characters. If you see a '?' in the output, but it is not mapped to 0xfffd, it is your display font or encoding that is the problem, not Java.

How to convert the byte offset from sqlite FTS query into a character offset in java

I have a problem where I am searching my FTS tables in android and I get returned a byte offset for the result :
col termno byteoffset size
1 0 111 4
However problem is, when using cursor.getString(colNo) it gives me a UTF-16 string after which I am unable to tally up which character of the text is the start/end of the match.
Its a similar problem to : Detect character position in an UTF NSString from a byte offset(was SQLite offsets() and encoding problem)
However I cannot fathom the solution in the question. So how can I accurately know the character offsets in my string (for highlight) after I know the byte offsets?
Encode your string back to the same encoding that Sqlite was using, then extract the pieces you want in byte form and convert them back to strings:
String chars = cursor.getString(colNo);
byte[] bytes = chars.getBytes("UTF-8");
String prefix = new String(bytes, 0, byteOffset, "UTF-8");
String match = new String(bytes, byteOffset, size, "UTF-8");
int charOffset = prefix.length;
int charSize = match.length;
(Assuming that your data is encoded as UTF-8 bytes, which is probable.)
It is unfortunate that you have to do all this redundant encoding and decoding. It might perhaps be worth adding optimisations to short-cut the pure-ASCII common case.

How to convert 1s and 0s to String?

Please have a look at the following machine code
‎0111001101110100011100100110010101110011011100110110010101100100
This means something. I need to convert this to string. When I use Integer.parseInt() with the above as the string and 2 as the radix(to convert it to bytes), it gives number format exception.
And I believe I have to seperate this into sets of 8 pieces (like ‎01110011 , 10111010, etc). Am I correct?
Please help me to convert this correctly to string.
Thanks
final String s =
"0111001101110100011100100110010101110011011100110110010101100100";
final StringBuilder b = new StringBuilder();
for (int i = 0; i < s.length(); i+=8)
b.append((char)Integer.parseInt(s.substring(i,i+8),2));
System.out.println(b);
prints "stressed"
A shorter way of reading large integers is to use BigInteger
final String s = "0111001101110100011100100110010101110011011100110110010101100100";
System.out.println(new String(new BigInteger('0'+s, 2).toByteArray(), 0));
prints
stressed
It depends on the encoding of the String.
An ASCII coded string uses 1 byte for each character while a unicode coded string takes 2 bytes for each character. There are many other types of encodings. The binary layout differs for each encoding.
So you need to find the encoding that was used to write this string to binary format

Categories