Reading all content of a Java BufferedReader including the line termination characters

Reading all content of a Java BufferedReader including the line termination characters - java

I'm writing a TCP client that receives some binary data and sends it to a device. The problem arises when I use BufferedReader to read what it has received.
I'm extremely puzzled by finding out that there is no method available to read all the data. The readLine() method that everybody is using, detects both \n and \r characters as line termination characters, so I can't get the data and concat the lines, because I don't know which char was the line terminator. I also can't use read(buf, offset, num), because it doesn't return the number of bytes it has read. If I read it byte by byte using read() method, it would become terribly slow. Please someone tell me what is the solution, this API seems quite stupid to me!
Well, first of all thanks to everyone. I think the main problem was because I had read tutorialspoint instead of Java documentation. But pardon me for it, as I live in Iran, and Oracle doesn't let us access the documentation for whatever reason it is. Thanks anyway for the patient and helpful responses.

This is more than likely an XY problem.
The beginning of your question reads:
I'm writing a TCP client that receives some binary data and sends it to a device. The problem arises when I use BufferedReader to read what it has received.
This is binary data; do not use a Reader to start with! A Reader wraps an InputStream using a Charset and yields a stream of chars, not bytes. See, among other sources, here for more details.
Next:
I'm extremely puzzled by finding out that there is no method available to read all the data
With reason. There is no telling how large the data may be, and as a result such a method would be fraught with problems if the data you receive is too large.
So, now that using a Reader is out of the way, what you really need to do is this:
read some binary data from a Socket;
copy this data to another source.
The solutions to do that are many; here is one solution which requires nothing but the standard JDK (7+):
final byte[] buf = new byte[8192]; // or other
try (
final InputStream in = theSocket.getInputStream();
final OutputStream out = whatever();
) {
int nrBytes;
while ((nrBytes = in.read(buf)) != -1)
out.write(buf, 0, nrBytes);
}
Wrap this code in a method or whatever etc.

I'm extremely puzzled by finding out that there is no method available to read all the data.
There are three.
The readLine() method that everybody is using, detects both \n and \r characters as line termination characters, so I can't get the data and concat the lines, because I don't know which char was the line terminator.
Correct. It is documented to suppress the line terminator.
I also can't use read(buf, offset, num), because it doesn't return the number of bytes it has read.
It returns the number of chars read.
If I read it byte by byte using read() method, it would become terribly slow.
That reads it char by char, not byte by byte, but you're wrong about the performance. It's buffered.
Please someone tell me what is the solution
You shouldn't be using a Reader for binary data in the first place. I can only suggest you re-read the Javadoc for:
BufferedInputStream.read() throws IOException;
BufferedInputStream.read(byte[]) throws IOException;
BufferedInputStream.read(byte[], int, int) throws IOException;
The last two both return the number of bytes read, or -1 at end of stream.
this API seems quite stupid to me!
No comment.

In the first place everyone who reads data has to plan for \n, \r, \r\n as possible sequences except when parsing HTTP headers which must be separated with \r\n. You could easily read line by line and output whatever line separator you like.
Secondly the read method returns the number of characters it has read into a char[] so that works exactly correctly if you want to read a chunk of chars and do your own line parsing and outputting.

The best thing I can recommended is that you use BufferedReader.read() and iterate over every character in the file. Something like this:
String filename = ...
br = new BufferedReader( new FileInputStream(filename));
while (true) {
String l = "";
Char c = " ";
while (true){
c = br.read();
if not c == "\n"{
// do stuff, not sure what you want with the endl encoding
// break to return endl-free line
}
if not c == "\r"{
// do stuff, not sure what you want with the endl encoding
// break to return endl-free line
Char ctwo = ' '
ctwo = br.read();
if ctwo == "\n"{
// do extra stuff since you know that you've got a \r\n
}
}
else{
l = l + c;
}
if (l == null) break;
...
l = "";
}
previously answered by #https://stackoverflow.com/users/615234/arrdem

Related

A code I don't understand about the write method in I/O

I just started learning IO and there's something about this code I don't understand:
public static void main(String[] args) throws IOException {
FileReader reader = new FileReader("C://blablablal.txt");
FileWriter writer = new FileWriter("C://blabla.txt");
int c;
while ((c = reader.read()) != -1) {
writer.write(c);
}
reader.close();
writer.close();
}
I'll be happy for an explenation of how does the write method, which writes "c"(an int) in the while loop, actually writes it as a character or a string in the txt file.
Thanks

Firstly, FileReader will be ready to read the content of the File at the given file path if the path is correct.
Secondly,FileWriter will be ready to write the something to the File if it has something to write.
Next, The reader reads the contents from the File until the end of the File(-1) and writes to another file using writer.
Atlast the Reader and Writer will get closed.
Comment here if you have any Doubts.

The write(int c) method casts the int (32 bits) to a char (16 bits). The char is then converted to appropriate bytes depending on the encoding. In this case the encoding used is the platform default encoding, but you should always specify the encoding used to make sure that it will work properly in any environment.
The reason it doesn't take char c as a parameter is apparently a design oversight which is too late to correct now.

reader.read() is reading one character from a file at a time. This seems a little confusing because you are reading a character in as an integer value, but all this means really is that the integer value equates to a character. When there are no more characters to read in the file a -1 is returned and this is how you know you've reached the end of the file.
writer.write behaves the same way. It writes the integer value of a character to a file.
Here is a link to a good tutorial on Java IO FileReader.
And here is one for Java IO FileWriter.

Best delimiter to safely parse byte arrays from a stream

I have a byte stream that returns a sequence of byte arrays, each of which represents a single record.
I would like to parse the stream into a list of individual byte[]s. Currently, i have hacked in a three byte delimiter so that I can identify the end of each record, but have concerns.
I see that there is a standard Ascii record separator character.
30 036 1E 00011110 RS  Record Separator
Is it safe to use a byte[] derived from this character a delimiter if the byte arrays (which were UTF-8 encoded) have been compressed and/or encrypted? My concern is that the encryption/compression output might produce the record separator for some other purpose. Please note the individual byte[] records are compressed/encrypted, rather than the entire stream.
I am working in Java 8 and using Snappy for compression. I haven't picked an encryption library yet, but it would certainly be one of the stronger, standard, private key approaches.

You can't simply declare a byte as delimiter if you're working with random unstructured data (which compressed/encrypted data resembles quite closely), because the delimiter can always appear as a regular data byte in such data.
If the size of the data is already known when you start writing, just generally write the size first and then the data. When reading back you then know you need th read the size first (e.g. 4 bytes for an int), and then as many bytes the size indicates.
This will obviously not work if you can't tell the size while writing. In that case, you can use an escaping mechanism, e.g. select a rarely appearing byte as the escapce character, escape all occurances of that byte in the data and use a different byte as end indicator.
e.g.
final static byte ESCAPE = (byte) 0xBC;
final static byte EOF = (byte) 0x00;
OutputStream out = ...
for (byte b : source) {
if (b == ESCAPE) {
// escape data bytes that have the value of ESCAPE
out.write(ESCAPE);
out.write(ESCAPE);
} else {
out.write(b);
}
}
// write EOF marker ESCAPE, EOF
out.write(ESCAPE);
out.write(EOF);
Now when reading and you read the ESCAPE byte, you read thex next byte and check for EOF. If its not EOF its an escaped ESCAPE that represents a data byte.
InputStream in = ...
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
while ((int b = in.read()) != -1) {
if (b == ESCAPE) {
b = in.read();
if (b == EOF)
break;
buffer.write(b);
} else {
buffer.write(b);
}
}
If the bytes to be written are perfectly randomly distributed this will increase the stream length by 1/256, for data domains that are not completely random, you can select the byte that is least frequently appearing (by static data analysis or just educated guess).
Edit: you can reduce the escaping overhead by using more elaborate logic, e.g. the example can only create ESCAPE + ESCAPE or ESCAPE + EOF. The other 254 bytes can never follow an ESCAPE in the example, so that could be exploited to store legal data combinations.

It is completely unsafe, you never know what might turn up in your data. Perhaps you should consider something like protobuf, or a scheme like 'first write the record length, then write the record, then rinse, lather, repeat'?
If you have a length, you don't need a delimiter. Your reading side reads the length, then knows how much to read for the first record, and then knows to read the next length -- all assuming that the lengths themselves are fixed-length.
See the developers' suggestions for streaming a sequence of protobufs.

Read lines of characters and get file position

I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.
At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.
Questions
Is there an easy way to do both, preferably using standard Java libraries?
If not, what is a reasonable workaround?
Attributes of an ideal solution
An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.
Details
For the first requirement, BufferedReader.readLine() is attractive. But buffering clearly interferes with getting a meaningful file position.
Less obviously, InputStreamReader also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
The method RandomAccessFile.readLine() reads a single byte per character.
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.

If you construct a BufferedReader from a FileReader and keep an instance of the FileReader accessible to your code, you should be able to get the position of the next line by calling:
fileReader.getChannel().position();
after a call to bufferedReader.readLine().
The BufferedReader could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.
Alternate Solution
What would be wrong with keeping track of the bytes yourself:
long startingPoint = 0; // or starting position if this file has been previously processed
while (readingLines) {
String line = bufferedReader.readLine();
startingPoint += line.getBytes().length;
}
this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.

This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).
In UTF-8:
All single-byte characters can be distinguished from all bytes in multi-byte characters. All the bytes in a multi-byte character have a '1' in the high-order position. In particular, the bytes representing LF and CR cannot be part of a multi-byte character.
All single-byte characters are in 7-bit ASCII. So we can decode a file containing only 7-bit ASCII characters with a UTF-8 decoder.
Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.
To avoid problems with buffering, we can use RandomAccessFile. That class provides methods to read a line, and get/set the file position.
Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.
protected static String
readNextLineAsUTF8( RandomAccessFile in ) throws IOException {
String rv = null;
String lineBytes = in.readLine();
if ( null != lineBytes ) {
rv = new String( lineBytes.getBytes(),
StandardCharsets.UTF_8 );
}
return rv;
}
Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in:
long startPos = in.getFilePointer();
String line = readNextLineAsUTF8( in );

The case seems to be solved by VTD-XML, a library able to quickly parse big XML files:
The last java VTD-XML ximpleware implementation, currently 2.13 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258

I would suggest java.io.LineNumberReader. You can set and get the line number and therefore continue at a certain line index.
Since it is a BufferedReader it is also capable of handling UTF-8.

Solution A
Use RandomAccessFile.readChar() or RandomAccessFile.readByte() in a loop.
Check for your EOL characters, then process that line.
The problem with anything else is that you would have to absolutely make sure you never read past the EOL character.
readChar() returns a char not a byte. So you do not have to worry about character width.
Reads a character from this file. This method reads two bytes from the file, starting at the current file pointer.
[...]
This method blocks until the two bytes are read, the end of the stream is detected, or an exception is thrown.
By using a RandomAccessFile and not a Reader you are giving up Java's ability to decode the charset in the file for you. A BufferedReader would do so automatically.
There are several ways of over coming this. One is to detect the encoding yourself and then use the correct read*() method. The other way would be to use a BoundedInput stream.
There is one in this question Java: reading strings from a random access file with buffered input
E.g. https://stackoverflow.com/a/4305478/16549

RandomAccessFile has a function:
seek(long pos)
Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs.

Initially, I found the approach suggested by Andy Thomas (https://stackoverflow.com/a/30850145/556460) the most appropriate.
But unfortunately I couldn't succeed in converting the byte array (taken from RandomAccessFile.readLine) to correct string in cases when the file line contains non-latin characters.
So I reworked the approach by writing a function similar to RandomAccessFile.readLine itself that collects data from line not to a string, but to a byte array directly, and then construct the desired String from the byte array.
So the following below code completely satisfied my needs (in Kotlin).
After calling the function, file.channel.position() will return the exact position of the next line (if any):
fun RandomAccessFile.readEncodedLine(charset: Charset = Charsets.UTF_8): String? {
val lineBytes = ByteArrayOutputStream()
var c = -1
var eol = false
while (!eol) {
c = read()
when (c) {
-1, 10 -> eol = true // \n
13 -> { // \r
eol = true
val cur = filePointer
if (read() != '\n'.toInt()) {
seek(cur)
}
}
else -> lineBytes.write(c)
}
}
return if (c == -1 && lineBytes.size() == 0)
null
else
java.lang.String(lineBytes.toByteArray(), charset) as String
}

Skip a line using BufferedReader (skip, but not read it)

Hi guys I am currently using BufferedReader to read files. I have something like:
br.readLine() != null
for each call loop.
And now what should I do if I want to skip a line. Here, I've read several similar questions other people posted, most of them suggested using readLine().
I know calling readLine() once will cause the pointer to the next line. But this is not preferred as I am considering the reading performance here. Although you seem to skip a line, the system actually read it already, so it is not time-efficiency. What I want is to move the pointer to the next line, without reading it.
Any good idea?

It's not possible to skip the line without reading it.
In order to know where to skip to, you have to know where the next new line character is, so you have to read it.
P.S. Unless you have a good reason not to, BufferedReader should be fine for you - it's quite efficient

You'll sometimes have a problem skipping the first line with code analysis tools (sonar etc), complaining that you didn't use the value. In the case of CSV, you may want to skip (i.e. not use) the first row if its headers. In which case, you could stream the lines and skip perhaps?
try (BufferedReader reader = new BufferedReader(new FileReader(filename))) {
Stream<String> lines = reader.lines().skip(1);
lines.forEachOrdered(line -> {
...
});
}

If you care about the memory wasted in the intermediate StringBuffer, you can try the following implementation:
public static void skipLine(BufferedReader br) throws IOException {
while(true) {
int c = br.read();
if(c == -1 || c == '\n')
return;
if(c == '\r') {
br.mark(1);
c = br.read();
if(c != '\n')
br.reset();
return;
}
}
}
Seems that it works nicely for all the EOLs supported by readLine ('\n', '\r', '\r\n').
As an alternative you may extend the BufferedReader class adding this method as instance method inside it.

In general, it's not possible in any language and with any API.
To skip a line you need to know the next line's offset. Either you have a file format that provides you that information, you are given line offsets as input, or you need to read every single byte just to know when a line ends and the next begins.

Have you read the code of readLine()? It reads chars one by one until it finds an \r or a \n and appends them to a StringBuffer. What you want is this behaviour without the burden of creating a StringBuffer. Just override the class BufferedReader and provide your own implementation which juste reads chars until /r or /n without building the useless String.
The only problem I see is that many fields are private...

Java File Replace Lines

I have a 250 GB big .txt file and i have just 50 GB space left on my harddrive.
Every line in this .txt file has a long prefix and i want to delete this prefix
to make that file smaller.
First i wanted to read line by line, change it and write it into another file.
// read line out of first file
line = line.replace(prefix, "");
// write line into second file
The Problem is i have not enough space for that.
So how can i delete all prefixes out out of my file?

Check RandomAccessFile: http://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html
You have to keep track of the position you are reading from and the position you are writing to. Initially both are at the start. Then you read N bytes (one line), shorten it, seek back N bytes and write M bytes (the shortened line). Then you seek forward (N - M) bytes to get back to the position where next line starts. Then you do this over and over again. In the end truncate excess with setLength(long).
You can also do it in batches (like read 4kb, process, write, repeat) to make it more efficient.
The process is identical in all languages. Some make it easier by hiding the seeking back and forth behind an API.
Of course you have to be absolutely sure that your program works flawlessly, since there is no way to undo this process.
Also, the RandomAccessFile is a bit limited, since it can not tell you at which position the file is at a given moment. Therefore you have to do conversion between "decoded strings" and "encoded bytes" as you go. If your file is in UTF-8, a given character in the string can take one ore many bytes in the file. So you can't just do seek(string.length()). You have to use seek(string.getBytes(encoding).length) and factor in possible line break conversions (Windows uses two characters for line break, Unix uses only one). But if you have ASCII, ISO-Latin-1 or similar trivial character encoding and know what line break chars the file has, then the problem should be pretty simple.
And as I edit my answer to match all possible corner cases, I think it would be better to read the file using BufferedReader and correct character encoding and also open a RandomAccessFile for doing the writing. If your OS supports having a file being opened twice. This way you would get complete Unicode support from BufferedReader and yuou wouldn't have to keep track of read and write positions. You have to do the writing with RandomAccessFile because using a Writer to the file may just truncate it (haven't tried it, though).
Something like this. It works on trivial examples but it has no error checking and I absolutely give no guarantees. Test it on a smaller file first.
public static void main(String[] args) throws IOException {
File f = new File(args[0]);
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(f), "UTF-8")); // Use correct encoding here.
RandomAccessFile writer = new RandomAccessFile(f, "rw");
String line = null;
long totalWritten = 0;
while ((line = reader.readLine()) != null) {
line = line.trim() + "\n"; // Remove your prefix here.
byte[] b = line.getBytes("UTF-8");
writer.write(b);
totalWritten += b.length;
}
reader.close();
writer.setLength(totalWritten);
writer.close();
}

You can use RandomAccessFile. That allows you to overwrite parts of the file. And since there is no copy- or caching-mechanism mentioned in the javadoc this should work without additional disk-space.
So you could overwrite the unwanted parts with spaces.

Split the 250 GB file into 5 files of 50 GB each. Then process each file and then delete it. This way you will always have 50 GB left on your machine and you will also be able to process 250 GB file.

Since it does not have to be done in Java, i would recommend Python for this:
Save the following in replace.py in the same folder with your textfile:
import fileinput
for line in fileinput.input("your-file.txt", inplace=True):
print "%s" % (line.replace("oldstring", "newstring"))
replace the two strings with your string and execute python replace.py

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.