Read lines of characters and get file position - java

I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.
At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.
Questions
Is there an easy way to do both, preferably using standard Java libraries?
If not, what is a reasonable workaround?
Attributes of an ideal solution
An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.
Details
For the first requirement, BufferedReader.readLine() is attractive. But buffering clearly interferes with getting a meaningful file position.
Less obviously, InputStreamReader also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
The method RandomAccessFile.readLine() reads a single byte per character.
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.

If you construct a BufferedReader from a FileReader and keep an instance of the FileReader accessible to your code, you should be able to get the position of the next line by calling:
fileReader.getChannel().position();
after a call to bufferedReader.readLine().
The BufferedReader could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.
Alternate Solution
What would be wrong with keeping track of the bytes yourself:
long startingPoint = 0; // or starting position if this file has been previously processed
while (readingLines) {
String line = bufferedReader.readLine();
startingPoint += line.getBytes().length;
}
this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.

This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).
In UTF-8:
All single-byte characters can be distinguished from all bytes in multi-byte characters. All the bytes in a multi-byte character have a '1' in the high-order position. In particular, the bytes representing LF and CR cannot be part of a multi-byte character.
All single-byte characters are in 7-bit ASCII. So we can decode a file containing only 7-bit ASCII characters with a UTF-8 decoder.
Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.
To avoid problems with buffering, we can use RandomAccessFile. That class provides methods to read a line, and get/set the file position.
Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.
protected static String
readNextLineAsUTF8( RandomAccessFile in ) throws IOException {
String rv = null;
String lineBytes = in.readLine();
if ( null != lineBytes ) {
rv = new String( lineBytes.getBytes(),
StandardCharsets.UTF_8 );
}
return rv;
}
Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in:
long startPos = in.getFilePointer();
String line = readNextLineAsUTF8( in );

The case seems to be solved by VTD-XML, a library able to quickly parse big XML files:
The last java VTD-XML ximpleware implementation, currently 2.13 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258

I would suggest java.io.LineNumberReader. You can set and get the line number and therefore continue at a certain line index.
Since it is a BufferedReader it is also capable of handling UTF-8.

Solution A
Use RandomAccessFile.readChar() or RandomAccessFile.readByte() in a loop.
Check for your EOL characters, then process that line.
The problem with anything else is that you would have to absolutely make sure you never read past the EOL character.
readChar() returns a char not a byte. So you do not have to worry about character width.
Reads a character from this file. This method reads two bytes from the file, starting at the current file pointer.
[...]
This method blocks until the two bytes are read, the end of the stream is detected, or an exception is thrown.
By using a RandomAccessFile and not a Reader you are giving up Java's ability to decode the charset in the file for you. A BufferedReader would do so automatically.
There are several ways of over coming this. One is to detect the encoding yourself and then use the correct read*() method. The other way would be to use a BoundedInput stream.
There is one in this question Java: reading strings from a random access file with buffered input
E.g. https://stackoverflow.com/a/4305478/16549

RandomAccessFile has a function:
seek(long pos)
Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs.

Initially, I found the approach suggested by Andy Thomas (https://stackoverflow.com/a/30850145/556460) the most appropriate.
But unfortunately I couldn't succeed in converting the byte array (taken from RandomAccessFile.readLine) to correct string in cases when the file line contains non-latin characters.
So I reworked the approach by writing a function similar to RandomAccessFile.readLine itself that collects data from line not to a string, but to a byte array directly, and then construct the desired String from the byte array.
So the following below code completely satisfied my needs (in Kotlin).
After calling the function, file.channel.position() will return the exact position of the next line (if any):
fun RandomAccessFile.readEncodedLine(charset: Charset = Charsets.UTF_8): String? {
val lineBytes = ByteArrayOutputStream()
var c = -1
var eol = false
while (!eol) {
c = read()
when (c) {
-1, 10 -> eol = true // \n
13 -> { // \r
eol = true
val cur = filePointer
if (read() != '\n'.toInt()) {
seek(cur)
}
}
else -> lineBytes.write(c)
}
}
return if (c == -1 && lineBytes.size() == 0)
null
else
java.lang.String(lineBytes.toByteArray(), charset) as String
}

Related

Write to a file with a specific encoding in Java

This might be related to my previous question (on how to convert "för" to "för")
So I have a file that I create in my code. Right now I create it by the following code:
FileWriter fwOne = new FileWriter(wordIndexPath);
BufferedWriter wordIndex = new BufferedWriter(fwOne);
followed by a few
wordIndex.write(wordBuilder.toString()); //that's a StringBuilder
ending (after a while-loop) with a
wordIndex.close();
Now the problem is later on this file is huge and I want (need) to jump in it without going through the entire file. The seek(long pos) method of RandomAccessFile lets me do this.
Here's my problem: The characters in the file I've created seem to be encoded with UTF-8 and the only info I have when I seek is the character-position I want to jump to. seek(long pos) on the other hand jumps in bytes, so I don't end up in the right place since an UTF-8 character can be more than one byte.
Here's my question: Can I, when I write the file, write it in ISO-8859-15 instead (where a character is a byte)? That way the seek(long pos) will get me in the right position. Or should I instead try to use an alternative to RandomAccessFile (is there an alternative where you can jump to a character-position?)
Now first the worrisome. FileWriter and FileReader are old utility classes, that use the default platform settings on that computer. Run elsewhere that code will give a different file, will not be able to read a file from another spot.
ISO-8859-15 is a single byte encoding. But java holds text in Unicode, so it
can combine all scripts. And char is UTF-16. In general a char index will not be a byte index, but in your case it probably works. But the line break might be one \n or two \r\n chars/bytes - platform dependently.
Re
Personally I think UTF-8 is well established, and it is easier to use:
byte[] bytes = string.getBytes(StandardCharsets.UTF_8);
string = new String(bytes, StandardCharsets.UTF_8);
That way all special quotes, euro, and so on will always be available.
At least specify the encoding:
Files.newBufferedWriter(file.toPath(), "ISO-8859-15");

Reading all content of a Java BufferedReader including the line termination characters

I'm writing a TCP client that receives some binary data and sends it to a device. The problem arises when I use BufferedReader to read what it has received.
I'm extremely puzzled by finding out that there is no method available to read all the data. The readLine() method that everybody is using, detects both \n and \r characters as line termination characters, so I can't get the data and concat the lines, because I don't know which char was the line terminator. I also can't use read(buf, offset, num), because it doesn't return the number of bytes it has read. If I read it byte by byte using read() method, it would become terribly slow. Please someone tell me what is the solution, this API seems quite stupid to me!
Well, first of all thanks to everyone. I think the main problem was because I had read tutorialspoint instead of Java documentation. But pardon me for it, as I live in Iran, and Oracle doesn't let us access the documentation for whatever reason it is. Thanks anyway for the patient and helpful responses.
This is more than likely an XY problem.
The beginning of your question reads:
I'm writing a TCP client that receives some binary data and sends it to a device. The problem arises when I use BufferedReader to read what it has received.
This is binary data; do not use a Reader to start with! A Reader wraps an InputStream using a Charset and yields a stream of chars, not bytes. See, among other sources, here for more details.
Next:
I'm extremely puzzled by finding out that there is no method available to read all the data
With reason. There is no telling how large the data may be, and as a result such a method would be fraught with problems if the data you receive is too large.
So, now that using a Reader is out of the way, what you really need to do is this:
read some binary data from a Socket;
copy this data to another source.
The solutions to do that are many; here is one solution which requires nothing but the standard JDK (7+):
final byte[] buf = new byte[8192]; // or other
try (
final InputStream in = theSocket.getInputStream();
final OutputStream out = whatever();
) {
int nrBytes;
while ((nrBytes = in.read(buf)) != -1)
out.write(buf, 0, nrBytes);
}
Wrap this code in a method or whatever etc.
I'm extremely puzzled by finding out that there is no method available to read all the data.
There are three.
The readLine() method that everybody is using, detects both \n and \r characters as line termination characters, so I can't get the data and concat the lines, because I don't know which char was the line terminator.
Correct. It is documented to suppress the line terminator.
I also can't use read(buf, offset, num), because it doesn't return the number of bytes it has read.
It returns the number of chars read.
If I read it byte by byte using read() method, it would become terribly slow.
That reads it char by char, not byte by byte, but you're wrong about the performance. It's buffered.
Please someone tell me what is the solution
You shouldn't be using a Reader for binary data in the first place. I can only suggest you re-read the Javadoc for:
BufferedInputStream.read() throws IOException;
BufferedInputStream.read(byte[]) throws IOException;
BufferedInputStream.read(byte[], int, int) throws IOException;
The last two both return the number of bytes read, or -1 at end of stream.
this API seems quite stupid to me!
No comment.
In the first place everyone who reads data has to plan for \n, \r, \r\n as possible sequences except when parsing HTTP headers which must be separated with \r\n. You could easily read line by line and output whatever line separator you like.
Secondly the read method returns the number of characters it has read into a char[] so that works exactly correctly if you want to read a chunk of chars and do your own line parsing and outputting.
The best thing I can recommended is that you use BufferedReader.read() and iterate over every character in the file. Something like this:
String filename = ...
br = new BufferedReader( new FileInputStream(filename));
while (true) {
String l = "";
Char c = " ";
while (true){
c = br.read();
if not c == "\n"{
// do stuff, not sure what you want with the endl encoding
// break to return endl-free line
}
if not c == "\r"{
// do stuff, not sure what you want with the endl encoding
// break to return endl-free line
Char ctwo = ' '
ctwo = br.read();
if ctwo == "\n"{
// do extra stuff since you know that you've got a \r\n
}
}
else{
l = l + c;
}
if (l == null) break;
...
l = "";
}
previously answered by #https://stackoverflow.com/users/615234/arrdem

Best delimiter to safely parse byte arrays from a stream

I have a byte stream that returns a sequence of byte arrays, each of which represents a single record.
I would like to parse the stream into a list of individual byte[]s. Currently, i have hacked in a three byte delimiter so that I can identify the end of each record, but have concerns.
I see that there is a standard Ascii record separator character.
30 036 1E 00011110 RS  Record Separator
Is it safe to use a byte[] derived from this character a delimiter if the byte arrays (which were UTF-8 encoded) have been compressed and/or encrypted? My concern is that the encryption/compression output might produce the record separator for some other purpose. Please note the individual byte[] records are compressed/encrypted, rather than the entire stream.
I am working in Java 8 and using Snappy for compression. I haven't picked an encryption library yet, but it would certainly be one of the stronger, standard, private key approaches.
You can't simply declare a byte as delimiter if you're working with random unstructured data (which compressed/encrypted data resembles quite closely), because the delimiter can always appear as a regular data byte in such data.
If the size of the data is already known when you start writing, just generally write the size first and then the data. When reading back you then know you need th read the size first (e.g. 4 bytes for an int), and then as many bytes the size indicates.
This will obviously not work if you can't tell the size while writing. In that case, you can use an escaping mechanism, e.g. select a rarely appearing byte as the escapce character, escape all occurances of that byte in the data and use a different byte as end indicator.
e.g.
final static byte ESCAPE = (byte) 0xBC;
final static byte EOF = (byte) 0x00;
OutputStream out = ...
for (byte b : source) {
if (b == ESCAPE) {
// escape data bytes that have the value of ESCAPE
out.write(ESCAPE);
out.write(ESCAPE);
} else {
out.write(b);
}
}
// write EOF marker ESCAPE, EOF
out.write(ESCAPE);
out.write(EOF);
Now when reading and you read the ESCAPE byte, you read thex next byte and check for EOF. If its not EOF its an escaped ESCAPE that represents a data byte.
InputStream in = ...
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
while ((int b = in.read()) != -1) {
if (b == ESCAPE) {
b = in.read();
if (b == EOF)
break;
buffer.write(b);
} else {
buffer.write(b);
}
}
If the bytes to be written are perfectly randomly distributed this will increase the stream length by 1/256, for data domains that are not completely random, you can select the byte that is least frequently appearing (by static data analysis or just educated guess).
Edit: you can reduce the escaping overhead by using more elaborate logic, e.g. the example can only create ESCAPE + ESCAPE or ESCAPE + EOF. The other 254 bytes can never follow an ESCAPE in the example, so that could be exploited to store legal data combinations.
It is completely unsafe, you never know what might turn up in your data. Perhaps you should consider something like protobuf, or a scheme like 'first write the record length, then write the record, then rinse, lather, repeat'?
If you have a length, you don't need a delimiter. Your reading side reads the length, then knows how much to read for the first record, and then knows to read the next length -- all assuming that the lengths themselves are fixed-length.
See the developers' suggestions for streaming a sequence of protobufs.

Reading chars from a stream of ByteArrays where boundary alignment may be imperfect

I'm working with asynchronous IO on the JVM, wherein I'm occasionally handed a byte array from an incoming socket. Concatenated, these byte arrays form a stream which my overall goal is to split into strings by instance of a given character, be it newline, NUL, or something more esoteric.
I do not have any guarantee that the boundaries of these consecutive byte arrays are not part of the way through a multi-byte character.
Reading through the documentation for java.nio.CharBuffer, I don't see any explicit semantics given as to how trailing partial multibyte characters are handled.
Given a series of ByteBuffers, what's the best way to get (complete) characters out of them, understanding that a character may span the gap between two sequencial ByteBuffers?
Use a CharsetDecoder:
final Charset charset = ...
final CharsetDecoder decoder = charset.newDecoder()
.onUnmappableCharacter(CodingErrorAction.REPORT)
.onMalformedInput(CodingErrorAction.REPORT);
I do have this problem in one of my projects, and here is how I deal with it.
Note line 258: if the result is a malformed input sequence then it may be an incomplete read; in that case, I set the last good offset to the last decoded byte, and start again from that offset.
If, on the next read, I fail to read again and the byte offset is the same, then this is a permanent failure (line 215).
Your case is a little different however since you cannot "backtrack"; you'd need to fill a new ByteBuffer with the rest of the previous buffer and the new one and start from there (allocate for oldBuf.remaining() + bufsize and .put() from oldBuf into the new buffer). In my case, my backend is a file, so I can .map() from wherever I want.
So, basically:
if you have an unmappable character, this is a permanent failure (your encoding just cannot handle your byte sequence);
if you have read the full byte sequence successfully, your CharBuffer will have buf.position() characters in it;
if you have a malformed input, it may mean that you have an incomplete byte sequence (for instance, using UTF-8, you have one byte out of a three byte sequence), but you need to confirm that with the next iteration.
Feel free to salvage any code you deem necessary! It's free ;)
FINAL NOTE, since I believe this is important: String's .getBytes(*) methods and constructors from byte arrays have a default CodingErrorAction of REPLACE!

XMLStreamReader: get character offset : XML from file

The XMLStreamReader->Location has a method called getCharacterOffset().
Unfortunately the Javadocs indicate this is an ambigously named method: it can also return a byte-offset (and this appears to be true in practice); unhelpfully this seems to occur when reading from files (for instance):
The Javadoc states :
Return the byte or character offset into the input source this
location is pointing to. If the input source is a file or a byte
stream then this is the byte offset into that stream, but if the input
source is a character media then the offset is the character offset. (emphasis added)
I really need the character offset; and I'm pretty sure I'm being given the byte offset instead.
The (UTF-8 encoded) XML is contained in a (partially corrupt 1G) file. [Hence the need to use a lower-level API which doesn't complain about the lack of well-formedness until it really has no choice but to].
Question
What does the Javadoc mean when it says '...input source is a character media...' : how can I force it to think of my input file as 'character media' - so that I get an accurate (Character) offset rather than a byte offset?
Extra blah blah:
[ I'm pretty sure this is what is going on - when I strip the file apart (using certain known high-level tags) I get a few characters missing or extra - in a non-accumalating way - I'm putting the difference down to a few multi-byte characters throwing off the counter: also when I copy (using 'head'/'tail' for instance in Powershell - this tool appears to correctly recognize [or assume UTF-8] and does a good conversion to UTF-16 as far as I can see ]
The offset is in units of the underlying Source.
The XMLStreamReader only knows how many units it has read from the Source so the offset is calculated in those units.
A Stream works in units of byte and therefore you end up with a byte offset.
A Reader works in units of char and therefore you end up with an offset in char.
The docs for StreamSource are more explicit in what the terms "character media" means.
Maybe try something like
final Source source = new StreamSource(new InputStreamReader(new FileInputStream(new File("my.xml")), "UTF-8"));
final XMLStreamReader xmlReader = XMLInputFactory.newFactory().createXMLStreamReader(source);
XMLInputFactory.createXMLStreamReader(java.io.InputStream) is a byte stream
XMLInputFactory.createXMLStreamReader(java.io.Reader) is a character stream

Categories