Encoding-aware RandomAccessReader implementation?

Encoding-aware RandomAccessReader implementation? - java

The default implementation of RandomAccessFile is 'broken', in the sense that you can't specify which encoding your file is in.
I'm looking for an alternative which matches the following criteria:
Encoding-aware
Random access! (dealing with very big files, need to be able to position the cursor using a byte offset without streaming the whole thing).
I had a poke around in Commons IO, but there's nothing there. I'd rather not have to implement this myself, because there are entirely too many places it could go wrong.

RandomAccessFile is intended for accessing binary data. It is not possible to efficiently create a random access encoded file which is appropriate in all situations.
Even if you find such a solution I would check it carefully to ensure it suits your needs.
If you were to write it, I would suggest considering a random position of row and column rather than character offset from the start of the file.
This has the advantage that you only have to remember where the start of each line is and you can scan the line to get your character. If you index the position of every character, this could use 4 bytes for every character (assuming the file is < 4 GB)

The answer turned out to be less painful than I assumed:
// This gives me access to buffering and charset magic
new BufferedReader(new InputStreamReader(Channels.newInputStream(randomAccessFile.getChannel()), encoding)), encoding
....
I can then implement a readLine() method which reads character by character. Using String.getBytes(encoding) I can keep track of the offset in the file. Calling seek() on the underlying RandomAccessFile allows me to reposition the cursor at will. There are probably some bugs lurking in there, but the basic tests seem to work.
public String readLine() throws IOException {
eol = "";
lastLineByteCount = 0;
StringBuilder builder = new StringBuilder();
char[] characters = new char[1];
int status = reader.read(characters, 0, 1);
if (status == -1) {
return null;
}
char c = characters[0];
while (status != -1) {
if (c == '\n') {
eol += c;
break;
}
if (c == '\r') {
eol += c;
} else {
builder.append(c);
}
status = reader.read(characters, 0, 1);
c = characters[0];
}
String line = builder.toString();
lastLineByteCount = line.getBytes(encoding).length + eol.getBytes(encoding).length;
return line;
}

Related

Java: number of lines in a file without processing it

I need to know the number of lines of a file before processing it, because I need to know the number of lines before read it, or in the worst case escenario read it twice..... so I made this code but It not works.. so maybe is just not possible ?
InputStream inputStream2 = getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(getInputStream()));
String line;
int numLines = 0;
while ((line = reader.readLine()) != null) {
numLines++;
}
TextFileDataCollection dataCollection = new TextFileDataCollection (numLines, 50);
BufferedReader reader2 = new BufferedReader(new InputStreamReader(inputStream2));
while ((line = reader2.readLine()) != null) {
StringTokenizer st = new StringTokenizer(reader2.readLine(), ",");
while (st.hasMoreElements()) {
System.out.println(st.nextElement());
}
}

Here's a similar question with java code, although it's a bit older:
Number of lines in a file in Java
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
EDIT:
Here's a reference related to inputstreams specifically:
From Total number of rows in an InputStream (or CsvMapper) in Java
"Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least."

You write
I need to know the number of lines of a file before processing it
but you don't present any file in your code; rather, you present only an InputStream. This makes a difference, because indeed no, you cannot know the number of lines in the input without examining the input to count them.
If you had a file name, File object, or similar mechanism by which you could access the data more than once, then that would be straightforward, but a stream is not guaranteed to be associated with any persistent file -- it might convey data piped from another process or communicated over a network connection, for example. Therefore, each byte provided by a generic InputStream can be read only once.
InputStream does provide an API for marking (mark()) a position and later returning to it (reset()), but stream implementations are not required to support it, and many do not. Those that do support it typically impose a limit on how far past the mark you can read before invalidating it. Readers support such a facility as well, with similar limitations.
Overall, if your only access to the data is via an InputStream, then your best bet is to process it without relying on advance knowledge of the contents. But if you want to be able to read the data twice, to count lines first, for example, then you need to make your own arrangements to stash the data somewhere in order to ensure your ability to do so. For example, you might copy it to a temporary file, or if you're prepared to rely on the input not being too large for it then you might store the contents in memory as a List of byte, byte[], char, or String.

How is the method read() in FileReader moving through a file?

So I just wrote a program that reads a specific file and returns the frequency of each character used. This was done by using a singly linked list(not java LinkedList, but very similar). What I want to know is why this:
while(txtFile.read() != -1){
Character letter = (char) txtFile.read();
freqBag.add(Character.toLowerCase(letter));
}
doesn't work(it doesn't return the correct frequency of the given character), and why this:
int c;
while((c = txtFile.read()) != -1){
Character letter = (char) c;
freqBag.add(Character.toLowerCase(letter));
}
works. I wrote the first one, and a friend helped me fix it.

It doesn't work because you're discarding characters. Each read() function brings back the next byte (as a signed int), so your code is dropping every even character (0, 2, 4...).
while(txtFile.read() != -1){ // Read and discard a character
Character letter = (char) txtFile.read(); // Read a character into letter
reqBag.add(Character.toLowerCase(letter)); // Store this letter
}
Your friend's code shouldn't be working either:
int c; // variable outside the loop
while((c = txtFile.read()) != -1){ // Read a character into c, compare to -1
Character letter = (char) txtFile.read(); // Read another character
freqBag.add(Character.toLowerCase(letter)); // Store this letter
}
The correct method would be to read just once:
int c;
while((c = txtFile.read()) != -1) {
freqBag.add(Character.toLowerCase((char)c));
}
I suspect either you have a typo, or you used a different file and didn't realize that letters were still being dropped.

First of all you need to keep in mind that when you call read method you already read one byte from file, so if you do it inside of your while statement you lose one byte.
Second thing is that for me (considering operators precedence) this two pieces of code does exact same thing so the problem might be in other part of code.

In java, when reading in a file one character at a time, how do I determine EOF?

I am having to read in a while and use an algorithm to code each letter and then print them to another file. I know generally to find the end of a file you would use readLine and check to see if its null. I am using a bufferedReader. Is there anyway to check to see if there is another character to read in? Basically, how do I know that I just read in the last character of the file?
I guess i could use readline and see if there was another line if I knew how to determine when I was at the end of my current line.
I found where the File class has a method called size() that supposidly turns the length in bytes of the file. Would that be telling me how many characters are in the file? Could i do while(charCount<length) ?

I don't exactly understand what you want to do. I guess you may want to read a file character by character. If so, you can do:
FileInputStream fileInput = new FileInputStream("file.txt");
int r;
while ((r = fileInput.read()) != -1) {
char c = (char) r;
// do something with the character c
}
fileInput.close();
FileInputStream.read() returns -1 when there are no more characters to read. It returns an int and not a char so a cast is mandatory.
Please note that this won't work if your file is in UTF-8 format and contains multi-byte characters. In that case you have to wrap the FileInputStream in an InputStreamReader and specify the appropriate charset. I'm omitting it here for the sake of simplicity.

From my understanding, buffers will return -1 if there are no characters left. So you could write:
BufferedInputStream in = new BufferedInputStream(new FileInputStream("filename"));
while (currentChar = in.read() != -1) {
//do something
}
in.close();

How to detect end of string in byte array to string conversion?

I receive from socket a string in a byte array which look like :
[128,5,6,3,45,0,0,0,0,0]
The size given by the network protocol is the total lenght of the string (including zeros) so , in my exemple 10.
If i simply do :
String myString = new String(myBuffer);
I have at the end of the string 5 non correct caracter. The conversion don't seems to detect the end of string caracter (0).
To get the correct size and the correct string i do this :
int sizeLabelTmp = 0;
//Iterate over the 10 bit to get the real size of the string
for(int j = 0; j<(sizeLabel); j++) {
byte charac = datasRec[j];
if(charac == 0)
break;
sizeLabelTmp ++;
}
// Create a temp byte array to make a correct conversion
byte[] label = new byte[sizeLabelTmp];
for(int j = 0; j<(sizeLabelTmp); j++) {
label[j] = datasRec[j];
}
String myString = new String(label);
Is there a better way to handle the problem ?
Thanks

May be its too late, But it may help others. The simplest thing you can do is new String(myBuffer).trim() that gives you exactly what you want.

0 isn't an "end of string character". It's just a byte. Whether or not it only comes at the end of the string depends on what encoding you're using (and what the text can be). For example, if you used UTF-16, every other byte would be 0 for ASCII characters.
If you're sure that the first 0 indicates the end of the string, you can use something like the code you've given, but I'd rewrite it as:
int size = 0;
while (size < data.length)
{
if (data[size] == 0)
{
break;
}
size++;
}
// Specify the appropriate encoding as the last argument
String myString = new String(data, 0, size, "UTF-8");
I strongly recommend that you don't just use the platform default encoding - it's not portable, and may well not allow for all Unicode characters. However, you can't just decide arbitrarily - you need to make sure that everything producing and consuming this data agrees on the encoding.
If you're in control of the protocol, it would be much better if you could introduce a length prefix before the string, to indicate how many bytes are in the encoded form. That way you'd be able to read exactly the right amount of data (without "over-reading") and you'd be able to tell if the data was truncated for some reason.

You can always start at the end of the byte array and go backwards until you hit the first non-zero. Then just copy that into a new byte and then String it. Hope this helps:
byte[] foo = {28,6,3,45,0,0,0,0};
int i = foo.length - 1;
while (foo[i] == 0)
{
i--;
}
byte[] bar = Arrays.copyOf(foo, i+1);
String myString = new String(bar, "UTF-8");
System.out.println(myString.length());
Will give you a result of 4.

Strings in Java aren't ended with a 0, like in some other languages. 0 will get turned into the so-called null character, which is allowed to appear in a String. I suggest you use some trimming scheme that either detects the first index of the array that's a 0 and uses a sub-array to construct the String (assuming all the rest will be 0 after that), or just construct the String and call trim(). That'll remove leading and trailing whitespace, which is any character with ASCII code 32 or lower.
The latter won't work if you have leading whitespace you must preserve. Using a StringBuilder and deleting characters at the end as long as they're the null character would work better in that case.

It appears to me that you are ignoring the read-count returned by the read() method. The trailing null bytes probably weren't sent, they are probably still left over from the initial state of the buffer.
int count = in.read(buffer);
if (count < 0)
; // EOS: close the socket etc
else
String s = new String(buffer, 0, count);

Not to dive into the protocol considerations that the original OP mentioned, how about this for trimming the trailing zeroes ?
public static String bytesToString(byte[] data) {
String dataOut = "";
for (int i = 0; i < data.length; i++) {
if (data[i] != 0x00)
dataOut += (char)data[i];
}
return dataOut;
}

Reading characters from a file written with .net

I'm trying to use java to read a string from a file that was written with a .net binaryWriter.
I think the problem is because the .net binary writer uses some 7 bit format for it's strings. By researching online, I came across this code that is supposed to function like the binary reader's readString() method. This is in my CSDataInputStream class that extends DataInputStream.
public String readStringCS() throws IOException {
int stringLength = 0;
boolean stringLengthParsed = false;
int step = 0;
while(!stringLengthParsed) {
byte part = readByte();
stringLengthParsed = (((int)part >> 7) == 0);
int partCutter = part & 127;
part = (byte)partCutter;
int toAdd = (int)part << (step*7);
stringLength += toAdd;
step++;
}
char[] chars = new char[stringLength];
for(int i = 0; i < stringLength; i++) {
chars[i] = readChar();
}
return new String(chars);
}
The first part seems to be working as it is returning the correct amount of characters (7). But when it reads the characters they are all Chinese! I'm pretty sure the problem is with DataInputStream.readChar() but I have no idea why it isn't working... I have even tried using
Character.reverseBytes(readChar());
to read the char to see if that would work, but it would just return different Chinese characters.
Maybe I need to emulate .net's way of reading chars? How would I go about doing that?
Is there something else I'm missing?
Thanks.

Okay, so you've parsed the length correctly by the sounds of it - but you're then treating it as the length in characters. As far as I can tell from the documentation it's the length in bytes.
So you should read the data into a byte[] of the right length, and then use:
return new String(bytes, encoding);
where encoding is the appropriate coding based on whatever was written from .NET... it will default to UTF-8, but it can be specified as something else.
As an aside, I personally wouldn't extend DataInputStream - I would compose it instead, i.e. make your type or method take a DataInputStream (or perhaps just take InputStream and wrap that in a DataInputStream). In general, if you favour composition over inheritance it can make code clearer and easier to maintain, in my experience.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.