I use BufferedReader's readLine() method to read lines of text from a socket.
There is no obvious way to limit the length of the line read.
I am worried that the source of the data can (maliciously or by mistake) write a lot of data without any line feed character, and this will cause BufferedReader to allocate an unbounded amount of memory.
Is there a way to avoid that? Or do I have to implement a bounded version of readLine() myself?
The simplest way to do this will be to implement your own bounded line reader.
Or even simpler, reuse the code from this BoundedBufferedReader class.
Actually, coding a readLine() that works the same as the standard method is not trivial. Dealing with the 3 kinds of line terminator CORRECTLY requires some pretty careful coding. It is interesting to compare the different approaches of the above link with the Sun version and Apache Harmony version of BufferedReader.
Note: I'm not entirely convinced that either the bounded version or the Apache version is 100% correct. The bounded version assumes that the underlying stream supports mark and reset, which is certainly not always true. The Apache version appears to read-ahead one character if it sees a CR as the last character in the buffer. This would break on MacOS when reading input typed by the user. The Sun version handles this by setting a flag to cause the possible LF after the CR to be skipped on the next read... operation; i.e. no spurious read-ahead.
Another option is Apache Commons' BoundedInputStream:
InputStream bounded = new BoundedInputStream(is, MAX_BYTE_COUNT);
BufferedReader reader = new BufferedReader(new InputStreamReader(bounded));
String line = reader.readLine();
The limit for a String is 2 billion chars. If you want the limit to be smaller, you need to read the data yourself. You can read one char at a time from the buffered stream until the limit or a new line char is reached.
Perhaps the easiest solution is to take a slightly different approach. Instead of attempting to prevent a DoS by limiting one particular read, limit the entire amount of raw data read. In this way you don't need to worry about using special code for every single read and loop, so long as the memory allocated is proportionate to incoming data.
You can either meter the Reader, or probably more appropriately, the undecoded Stream or equivalent.
There are a few ways round this:
if the amount of data overall is very small, load data in from the socket into a buffer (byte array, bytebuffer, depending on what you prefer), then wrap the BufferedReader around the data in memory (via a ByteArrayInputStream etc);
just catch the OutOfMemoryError, if it occurs; catching this error is generally not reliable, but in the specific case of catching array allocation failures, it is basically safe (but does not solve the issue of any knock-on effect that one thread allocating large amounts from the heap could have on other threads running in your application, for example);
implement a wrapper InputStream that will only read so many bytes, then insert this between the socket and BufferedReader;
ditch BufferedReader and split your lines via the regular expressions framework (implement a CharSequence whose chars are pulled from the stream, and then define a regular expression that limits the length of lines); in principle, a CharSequence is supposed to be random access, but for a simple "line splitting" regex, in practice you will probably find that successive chars are always requested, so that you can "cheat" in your implementation.
In BufferedReader, instead of using String readLine(), use int read(char[] cbuf, int off, int len); you can then use boolean ready() to see if you got it all and convert in into a string using the constructor String(byte[] bytes, int offset, int length).
If you don't care about the whitespace and you just want to have a maximum number of characters per line, then the proposal Stephen suggested is really simple,
import java.io.BufferedReader;
import java.io.IOException;
public class BoundedReader extends BufferedReader {
private final int bufferSize;
private char buffer[];
BoundedReader(final BufferedReader in, final int bufferSize) {
super(in);
this.bufferSize = bufferSize;
this.buffer = new char[bufferSize];
}
#Override
public String readLine() throws IOException {
int no;
/* read up to bufferSize */
if((no = this.read(buffer, 0, bufferSize)) == -1) return null;
String input = new String(buffer, 0, no).trim();
/* skip the rest */
while(no >= bufferSize && ready()) {
if((no = read(buffer, 0, bufferSize)) == -1) break;
}
return input;
}
}
Edit: this is intended to read lines from a user terminal. It blocks until the next line, and returns a bufferSize-bounded String; any further input on the line is discarded.
Related
Im trying to read information sent for a client on android using the TCP protocol. In my server I have this code:
InputStream input = clienteSocket.getInputStream();
int c = input.read();
c will containt the ascci number that the client send.
I also can get this by writing:
BufferedReader in = new BufferedReader(new InputStreamReader(socket.getInputStream()));
I would like to know what is the difference between both methods.
You're comparing apples and oranges here.
Your first example reads one byte from the stream, unbuffered, and returns the value of that byte. (Adding 'ASCII number' to that adds no actual information.)
Your second example sets up a buffered reader, which can read chars from the stream, buffered, but it doesn't actually read anything.
You could set up two further examples:
InputStream is = new BufferedInputStream(socket.getInputStream());
int c = is.read();
This reads a byte, with buffering.
Reader reader = new InputStreamReader(socket.getInputStream();
int c = reader.read();
This reads a char, with a little buffering: not as much as BufferedReader provides.
The realistic choices are between the two buffered versions, for efficiency reasons as outlined by #StephenC, and the choice between them is dictated by whether you want bytes or chars.
The buffered approach is better because (in most cases) reduces the number of syscalls that the JVM needs to make to the operating system. Since syscalls are relatively expensive, buffering generally gives you better performance.
In your specific example:
Each time you call c.read() on an input stream you do a syscall.
The first time you do a c.read() (or other read operation) on a buffered input stream, it reads a number of bytes into an in-memory byte-array. In second, third, etc calls to c.read(), the read will typically return a byte out of the in-memory buffer, without making a syscall.
In your example, the only case where using a buffered stream doesn't help would be if you are going to read only one byte from the socket, and then close it.
UPDATE
I didn't notice that you were comparing an unbuffered InputStream with a buffered >> Reader <<. As #EJP, points out, this is "comparing Apples and Oranges". The functionality of the two versions is different. One reades bytes and the other reads characters.
(And if you don't understand that distinction ... and why it is an important distinction ... you would be advised to read the Java Tutorial lesson on Basic I/O. Particularly the sections on byte streams, character streams and buffered streams.)
I am looking for some util class/method to take a large String and return an InputStream.
If the String is small, I can just do:
InputStream is = new ByteArrayInputStream(str.getBytes(<charset>));
But when the String is large(1MB, 10MB or more), a byte array 1 to 2 times(or more?) as large as my String is allocated on the spot. (And since you won't know how many bytes to allocate exactly before all the encoding is done, I think there must be other arrays/buffers allocated before the final byte array is allocated).
I have performance requirements, and want to optimize this operation.
Ideally I think, the class/method I am looking for would encode the characters on the fly one small block at a time as the InputStream is being consumed, thus no big surge of mem allocation.
Looking at the source code of apache commons IOUtils.toInputStream(..), I see that it also converts the String to a big byte array in one go.
And StringBufferInputStream is Deprecated, and does not do the job properly.
Is there such util class/method from anywhere? Or I can just write a couple of lines of code to do this?
The functional need for this is that, elsewhere, I am using a util method that takes an InputStream and stream out the bytes from this InputStream.
I haven't seem other people looking for something like this. Am I mistaking something somewhere?
I have started writing a custom class for this, but would stop if there is a better/proper/right solution/correction to my need.
The Java built-in libraries assume you'd only need to go from chars to bytes in output, not input. The Apache Commons IO libraries have ReaderInputStream, however, which can wrap a StringReader to get what you want.
For me there is a fundamental problem. Why do you have such huge Strings in memory in the first place...
Anyway, you can try this:
public static InputStream largeStringToBytes(final String tooLarge,
final Charset charset)
{
final CharsetEncoder encoder = charset.newEncoder()
.onUnmappableCharacter(CodingErrorAction.REPORT);
final ByteBuffer buf = charset.encode(CharBuffer.wrap(tooLarge));
return new ByteArrayInputStream(buf.array());
}
If you are passing the large string as parameter then the memory is already allocated. A string that big cannot even be pushed on to the stack (most of the time max stack size is 1MB) so this is getting allocated on the heap just to pass it as a parameter. The only way I can see to avoid this would be to create a tree on disk where you streamed back a chracter at a time as you walked the tree. If you have multiple large strings perhaps to can index them in a Trie or a DAWG and walk that structure. This will eliminate many of the duplicate characters between strings. But, I will need to know more about what the strings represent to assist further.
Implement your own String-backed input stream:
class StringifiedInputStream extends InputStream {
private int idx=0;
private final String str;
private final int len;
StringifiedInputStream(String str) {
this.str = str;
this.len = str.length();
}
#Override
public int read() throws IOException {
if(idx>=len)
return -1;
return (byte) str.charAt(idx++);
}
}
This is slow, but it streams the bytes without byte array duplication. Add the 3-arg method to this implementation if speed is an issue.
Sorry if this question is a dulplicate but I didn't get an answer I was looking for.
Java docs says this
In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders >and InputStreamReaders. For example,
BufferedReader in = new BufferedReader(new FileReader("foo.in"));
will buffer the input from the specified file. Without buffering, each invocation of read() or readLine() could cause bytes to be read from the file, converted into characters, and then returned, which can be very inefficient.
My first question is If bufferedReader can read bytes then why can't we work on images which are in bytes using bufferedreader.
My second question is Does Bufferedreader store characters in BUFFER and what is the meaning of this line
will buffer the input from the specified file.
My third question is what is the meaning of this line
In general, each read request made of a Reader causes a corresponding read request to be >made of the underlying character or byte stream.
There are two questions here.
1. Buffering
Imagine you lived a mile from your nearest water source, and you drink a cup of water every hour. Well, you wouldn't walk all the way to the water for every cup. Go once a day, and come home with a bucket full of enough water to fill the cup 24 times.
The bucket is a buffer.
Imagine your village is supplied water by a river. But sometimes the river runs dry for a month; other times the river brings so much water that the village floods. So you build a dam, and behind the dam there is a reservoir. The reservoir fills up in the rainy season and gradually empties in the dry season. The village gets a steady flow of water all year round.
The reservoir is a buffer.
Data streams in computing are similar to both those scenarios. For example, you can get several kilobytes of data from a filesystem in a single OS system call, but if you want to process one character at a time, you need something similar to a reservoir.
A BufferedReader contains within it another Reader (for example a FileReader), which is the river -- and an array of bytes, which is the reservoir. Every time you read from it, it does something like:
if there are not enough bytes in the "reservoir" to fulfil this request
top up the "reservoir" by reading from the underlying Reader
endif
return some bytes from the "reservoir".
However when you use a BufferedReader, you don't need to know how it works, only that it works.
2. Suitability for images
It's important to understand that BufferedReader and FileReader are examples of Readers. You might not have covered polymorphism in your programming education yet, so when you do, remember this. It means that if you have code which uses FileReader -- but only the aspects of it that conform to Reader -- then you can substitute a BufferedReader and it will work just the same.
It's a good habit to declare variables as the most general class that works:
Reader reader = new FileReader(file);
... because then this would be the only change you need to add buffering:
Reader reader = new BufferedReader(new FileReader(file));
I took that detour because it's all Readers that are less suitable for images.
Reader has two read methods:
int read(); // returns one character, cast to an int
int read(char[] block); // reads into block, returns how many chars it read
The second form is unsuitable for images because it definitely reads chars, not ints.
The first form looks as if it might be OK -- after all, it reads ints. And indeed, if you just use a FileReader, it might well work.
However, think about how a BufferedReader wrapped around a FileReader will work. The first time you call BufferedReader.read(), it will call FileReader.read(buffer) to fill its buffer. Then it will cast the first char of the buffer back to an int, and return that.
Especially when you bring multi-byte charsets into the picture, that can cause problems.
So if you want to read integers, use InputStream not Reader. InputStream has int read(byte[] buf, int offset, int length) -- bytes are much more reliably cast back and forth from int than chars.
Readers (and Writers) in java are specialized classes for dealing with text (character) streams - the concept of a line is meaningless in any other type of stream.
for the general IO equivalent, have a look at BufferedInputStream
so, to answer your questions:
while the reader does eventually read bytes, it converts them to characters. it is not intended to read anything else (like images) - use the InputStream family of classes for that
a buffered reader will read large blocks of data from the underlying stream (which may be a file, socket, or anything else) into a buffer in memory and will then serve read requests from this buffer until the buffer is emptied. this behaviour of reading large chunks instead of smaller chucks every time improves performance.
it means that if you dont wrap a reader in a buffered reader then every time you want to read a single character, it will access the disk.network to get just the single character you want. doing I/O in such small chunks is usually terrible for performance.
Default behaviour is it will convert to character, but when you have an image you cannot have a character data, instead you need pixel of bytes data. So you cannot use it.
It is buffereing, means , it is reading a certain chunk of data in an char array. You can see this behaviour in the code:
public BufferedReader(Reader in) {
this(in, defaultCharBufferSize);
}
and the defaultCharBufferSize is as mentioned below:
private static int defaultCharBufferSize = 8192;
3 Every time you do read operation, it will be reading only one character.
So in a nutshell, buffred means, it will read few chunk of character data first that will keep in a char array and that will be processed and again it will read same chunk of data until it reaches end of stream
You can refer the following to get to know more
BufferedReader
I am working on a project and have a question about Java sockets. The source file which can be found here.
After successfully transmitting the file size in plain text I need to transfer binary data. (DVD .Vob files)
I have a loop such as
// Read this files size
long fileSize = Integer.parseInt(in.readLine());
// Read the block size they are going to use
int blockSize = Integer.parseInt(in.readLine());
byte[] buffer = new byte[blockSize];
// Bytes "red"
long bytesRead = 0;
int read = 0;
while(bytesRead < fileSize){
System.out.println("received " + bytesRead + " bytes" + " of " + fileSize + " bytes in file " + fileName);
read = socket.getInputStream().read(buffer);
if(read < 0){
// Should never get here since we know how many bytes there are
System.out.println("DANGER WILL ROBINSON");
break;
}
binWriter.write(buffer,0,read);
bytesRead += read;
}
I read a random number of bytes close to 99%. I am using Socket, which is TCP based,
so I shouldn't have to worry about lower layer transmission errors.
The received number changes but is always very near the end
received 7258144 bytes of 7266304 bytes in file GLADIATOR/VIDEO_TS/VTS_07_1.VOB
The app then hangs there in a blocking read. I am confounded. The server is sending the correct
file size and has a successful implementation in Ruby but I can't get the Java version to work.
Why would I read less bytes than are sent over a TCP socket?
The above is because of a bug many of you pointed out below.
BufferedReader ate 8Kb of my socket's input. The correct implementation can be found
Here
If your in is a BufferedReader then you've run into the common problem with buffering more than needed. The default buffer size of BufferedReader is 8192 characters which is approximately the difference between what you expected and what you got. So the data you are missing is inside BufferedReader's internal buffer, converted to characters (I wonder why it didn't break with some kind of conversion error).
The only workaround is to read the first lines byte-by-byte without using any buffered classes readers. Java doesn't provide an unbuffered InputStreamReader with readLine() capability as far as I know (with the exception of the deprecated DataInputStream.readLine(), as indicated in the comments below), so you have to do it yourself. I would do it by reading single bytes, putting them into a ByteArrayOutputStream until I encounter an EOL, then converting the resulting byte array into a String using the String constructor with the appropriate encoding.
Note that while you can't use a BufferedInputReader, nothing stops you from using a BufferedInputStream from the very beginning, which will make byte-by-byte reads more efficient.
Update
In fact, I am doing something like this right now, only a bit more complicated. It is an application protocol that involves exchanging some data structures that are nicely represented in XML, but they sometimes have binary data attached to them. We implemented this by having two attributes in the root XML: fragmentLength and isLastFragment. The first one indicates how much bytes of binary data follow the XML part and isLastFragment is a boolean attribute indicating the last fragment so the reading side knows that there will be no more binary data. XML is null-terminated so we don't have to deal with readLine(). The code for reading looks like this:
InputStream ins = new BufferedInputStream(socket.getInputStream());
while (!finished) {
ByteArrayOutputStream buf = new ByteArrayOutputStream();
int b;
while ((b = ins.read()) > 0) {
buf.write(b);
}
if (b == -1)
throw new EOFException("EOF while reading from socket");
// b == 0
Document xml = readXML(new ByteArrayInputStream(buf.toByteArray()));
processAnswers(xml);
Element root = xml.getDocumentElement();
if (root.hasAttribute("fragmentLength")) {
int length = DatatypeConverter.parseInt(
root.getAttribute("fragmentLength"));
boolean last = DatatypeConverter.parseBoolean(
root.getAttribute("isLastFragment"));
int read = 0;
while (read < length) {
// split incoming fragment into 4Kb blocks so we don't run
// out of memory if the client sent a really large fragment
int l = Math.min(length - read, 4096);
byte[] fragment = new byte[l];
int pos = 0;
while (pos < l) {
int c = ins.read(fragment, pos, l - pos);
if (c == -1)
throw new EOFException(
"Preliminary EOF while reading fragment");
pos += c;
read += c;
}
// process fragment
}
Using null-terminated XML for this turned out to be a really great thing as we can add additional attributes and elements without changing the transport protocol. At the transport level we also don't have to worry about handling UTF-8 because XML parser will do it for us. In your case you're probably fine with those two lines, but if you need to add more metadata later you may wish to consider null-terminated XML too.
Here is your problem. The first few lines of the program your using in.readLine() which is probably some sort of BufferedReader. BufferedReaders will read data off the socket in 8K chunks. So when you did the first readLine() it read the first 8K into the buffer. The first 8K contains your two numbers followed by newlines, then some portion of the head of the VOB file (that's the missing chunk). Now when you switched to using the getInputStream() off the socket you are 8K into the transmission assuming your starting at zero.
socket.getInputStream().read(buffer); // you can't do this without losing data.
While the BufferedReader is nice for reading character data, switching between binary and character data in a stream is not possible with it. You'll have to switch to using InputStream instead of Reader and convert the first few portions by hand to character data. If you read the file using a buffered byte array you can read the first chunk, look for your newlines and convert everything to the left of that to character data. Then write everything to the right to your file, then start reading the rest of the file.
This used to be easier with DataInputStream, but it doesn't do a good job handling character conversion for you (readLine is deprecated with BufferedReader being the only replacement - doh). Probably should write a DataInputStream replacement that under the covers uses Charset to properly handle string conversion. Then switching between characters and binary would be easier.
Your basic problem is that BufferedReader will read as much data is available and place in its buffer. It will give you the data as you ask for it. This is the whole point of buffereing i.e. to reduce the number of calls to the OS. The only safe way to use an buffered input is to use the same buffer over the life of the connection.
In your case, you only use the buffer to read two lines, however it is highly likely that 8192 bytes has been read into the buffer. (The default size of the buffer) Say the first two lines consist of 32 bytes, this leaves 8160 waiting for you to read, however you by-pass the buffer to perform the read() on the socket directly leading to 8160 bytes left in the buffer you end up discarding. (the amount you are missing)
BTW: You should be able to see this in a debugger if you inspect the contents of your buffered reader.
Sergei may have been right about data being lost inside the buffer, but I'm not sure about his explanation. (BufferedReaders don't usually hold onto data inside their buffers. He may be thinking of a problem with BufferedWriters, which can lose data if the underlying stream is shut down prematurely.) [Never mind; I had misread Sergei's answer. The rest of this is valid AFAIK.]
I think you have a problem that's specific to your application. In your client code, you start reading as follows:
public static void recv(Socket socket){
try {
BufferedReader in = new BufferedReader(new InputStreamReader(socket.getInputStream()));
//...
int numFiles = Integer.parseInt(in.readLine());
... and you proceed to use in for the start of the exchange. But then you switch to using the raw socket stream:
while(bytesRead > fileSize){
read = socket.getInputStream().read(buffer);
Because in is a BufferedReader, it's already going to have filled its buffer with up to 8192 bytes from the socket input stream. Any bytes that are in that buffer, and which you don't read from in, will be lost. Your app is hanging because it believes that the server is holding onto some bytes, but the server doesn't have them.
The solution is not to do byte-by-byte reads from the socket (ouch! your poor CPU!), but to use the BufferedReader consistently. Or, to use buffering with binary data, change the BufferedReader to a BufferedInputStream that wraps the socket's InputStream.
By the way, TCP is not as reliable as many people assume it to be. For example, when the server socket closes, it's possible for it to have written data into the socket which then gets lost as the socket connection is shutdown. Calling Socket.setSoLinger can help to prevent this problem.
EDIT: Also BTW, you're playing with fire by treating byte and character data as if they're interchangeable, as you do below. If the data really is binary, then the conversion to String risks corrupting the data. Perhaps you want to be writing into a BufferedOutputStream?
// Java is retarded and reading and writing operate with
// fundamentally different types. So we write a String of
// binary data.
fileWriter.write(new String(buffer));
bytesRead += read;
EDIT 2: Clarified (or attempted to clarify :-} the handling of binary vs. String data.
Right Now I have
;; buffer->string: BufferedReader -> String
(defn buffer->string [buffer]
(loop [line (.readLine buffer) sb (StringBuilder.)]
(if(nil? line)
(.toString sb)
(recur (.readLine buffer) (.append sb line)))))
This is too slow.
Edit:
I have a BufferedReader
when i try to do (str BufferedReader) it gives me "java.io.BufferedReader#1ce784b"
the above loop is too slow and I run out of memory space.
(clojure.contrib.duck-streams/slurp* your-buffer) ; is what you want
Your code is slow because buffer isn't hinted.
I don't know Clojure, so I can't tell if you have some detail wrong in your code, but using StringBuffer and appending the input line by line is the correct way to do it (well, using a StringBuilder initialized to the expected final size if known would bring significant but not dramatic improvements).
If you run out of memory, then maybe the content of your BufferedReader is simply too large to fit into your memory and there is no way to have it as a single string - in that case, you'll either have to increase your heap size or find a way to process the data one small chunk at a time.
BTW, if you know the size of your input, a more efficient method would be to use a CharBuffer and fill it by using Reader.read() (you'll have to pay attention to the return method and use it in a loop).
buffer.ToString()? Or in your case, maybe (.toString buffer)?
in java you would do something like;
public String getStringFromBuffer(){
BufferedReader bRead = new BufferedReader();
String line = null;
StringBuffer theText = new StringBuffer();
while((line=bRead.readLine())!=null){
theText.append(line+"\n);
}
return theText.toString();
}
I don't know clojure, just Java. Lets work from there.
Some points to consider:
If your target JVM version is >= 1.5 you can use StringBuilder instead of StringBuffer for a small performance improvement (no synchronization and you don't need it). Read about it here
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/StringBuilder.html
But your big performance cost is probably on the buffer expansion. When you instantiate a StringBuffer/StringBuilder without using the constructor with the capacity argument you get a small capacity.
When starting with a small capacity (the internal buffer size) you have many expansions - every time you exceed that capacity, its internal buffer is reallocated to a new capacity, just large enough to hold the newly appended text, which means copying all previously held text to the new buffer.
This is very slow when you are appending more text to an already very large string.
If you have access to the size of the text you are reading (a file size would be an approximation) you can significantly reduce the amount of expansions.
I could also tell you to use read() method of the BufferedReader, the one with 3 arguments, this one:
BufferedReader.read(char[], int, int)
You could then use one of the String's class constructors that accept a char array to convert the char buffer into a String:
String.String(char[], int, int)
...however, I suspect that the performance improvement will not be that big, especially when compared with the one of reducing how many StringBuilder expansions you'll have.
Whatever the approximation, you seem to have memory capacity problem:
In the end you will need at least twice as much memory as the whole text occupies.
Either if you use the StringBuilder/StringBuffer approach or the other one, in the end you will have to copy the text contents to the new String holding the result.
In the end you will probably need to work out of this box:
Are you sure you only have a BufferedReader as a start and a String as an end? You should provide the broader picture!
If this is the broadest you have, you will need at least a JVM instance configured with more heap since you will probably run out of memory with any of this solutions anyway.
use slurp to read (reasonably sized files) in
use spit to write them back out again.