Reusing socket when copying large amounts of data

Reusing socket when copying large amounts of data - java

I am currently sending large amounts of data over a Java socket, and I am using the Apache Commons IOUtils library to copy using the copyLarge method to send/receive the data. The problem is that the copyLarge reads until the the input stream returns -1. I have copied a snippet below
while (-1 != (n = input.read(buffer))) {
output.write(buffer, 0, n);
count += n;
}
This method will block until the socket is closed, but the problem is I want to reuse the socket to additional large data. The alternatives I see are to either reopen a new socket for each data being transferred, or to write my own read method that looks for an end of stream token (i.e. a newline character).
It has been a while since I've written low level socket code like this, but am I missing something here? Or is there an easier way to do this?

Do you know how much data you have before sending it over? If so, I'd basically length-prefix the message.
That's much easier to handle cleanly than using an end-of-stream token and having to worry about escaping, over-reading etc.
But yes, you'll need to do something, because TCP/IP is a stream-based protocol. Unless you have some indicator for the end of data somehow, you never know whether there might be some more to come Real Soon Now.

Related

Tomcat Servlet performance: StringBuilder vs. direct write

This is one for the tomcat / network experts. I would benchmark / wireshark it but this is pretty demanding and perhaps someone knows the answer offhand.
Comparing these two methods for generating servlet output, which one would be the fastest from a user's perspective:
Writing direct to the servlet output stream:
for( int i=0; i<10000; i++ ) {
servletOutputStream.write( "a" );
/* a little bit of delay */
}
Creating a buffer and write it in one turn
for( int i=0; i<10000; i++ ) {
stringbuffer.append( "a" );
}
servletOutputStream.write( stringBuffer.toString() )
I can imagine the PROs of method 1 would be that the response can start sending stuff quickly while in method 2 the sending starts later.
On the other hand method 1 could generate more / smaller TCP packets which in turn could take longer to transmit completely?
Regards
PS: Please, don't tell me this is premature optimization. In the case at hand I have an object which offers both toString and write(Appendable a) methods. I've simply have to choose which one to use here. Additionally I find this very interesting from a theoretical point of view and regarding the general design of servlets.
EDIT: Thanks all for the answers. But it seems I was to unclear in my question or oversimplified my example.
I'm not worried about not buffering at all. I know that there must be buffering at least in one place in the sending queue. Probably it is in multiple places (Java,OS,Hardware). I think the real question I have is this: When is are these buffers flushed?
So to make it more clear lets assume we have a MTU of 1000 and sending of consecutive packets is triggered by a buffer-empty interrupt by the hardware. Then in the first case it could look like:
. packet( "a" ) //triggered by the first write( "a" ),
. packet( "aaaaaaa" ) // triggered by buffer-empty, sending the amount of "a"s which have been written in the meantime
. packet( "aaaa" ) // and so on
. packet( "aaaaaaaaaaa" )
...x1000 // or so in this example
While for the second case there are all 10000 bytes already available when sending starts and so the result would be:
. packet( "aaaa....a(x1000)" )
. packet( "aaaa....a(x1000)" )
...x10
Even for smaller data sizes (smaller than MTU, lets say 100 "a"s) and creating the output faster then it could be send the result could look like:
. packet( "a" ) // first write
. packet( "aaaa...a(x99) ) // all remaining data available when buffer-empty interrupt.
Of course all this would be quiet different if the buffer(s) where working differently. E.g. if they would be waiting for more data to send or waiting for a flush to send anything at all ... (but this in turn would slow down sending in some respect, too)
So this is what I don't know: How exactly is this buffering within tomcat working and what would be the the best strategy of using it?
(And I'm not worrying or expecting larger speed gains. I just like to know how things work.)

I expect that the ServletOutputStream is actually an instance of
org.apache.tomcat.core.BufferedServletOutputStream
which is (as the name suggests) is a buffered stream. That will mean that it is better to write characters directly to the stream rather than assembling them in a StringBuffer or StringBuilder and writing the result. Writing directly will avoid at least one copy of the characters.
If it turns out that your ServletOutputStream is not buffered already, then you can wrap it in a BufferedOutputStream, and you will get the same result.
Assuming now that you are talking about the streams now. (Flushing a StringBuffer has no meaning.)
When is are these buffers flushed?
When they are full, when you call flush on the stream, or when the stream is closed.
... and what would be the the best strategy of using it?
In general, write the data and when you are finished, close the file. Don't flush explicitly, unless there is a good reason to do so. There rarely is, if you are delivering ordinary HTTP responses. (A flush is liable to cause the network stack to transmit the same amount of information by sending more network packets. That could impact on overall network throughput.)
In the case of the servlet framework, I recall that the Servlet specification says that a ServletOutputStream will automatically be flushed and closed when the request/response processing is finished. Provided that you didn't wrap the ServletOutputStream, you don't even need to close the stream. (It does no harm though.)

There's no doubt that writing directly to the output stream will be faster for a number of reasons:
The output buffer is fixed
The output buffer will be flushed automatically when it's full (and I'd argue that it doesn't matter when this happens, so stop worrying about it)
The output buffer will be re-used
Your StringBuilder can grow very large, taking up lots of heap space
Your StringBuilder will re-allocate its space at intervals, causing new objects to be created, data copied all over the place, etc
All that memory activity will create "garbage" that the GC will have to deal with
However
I would argue that your analysis isn't taking into account a ver important factor: detection and recovery from errors.
If you have a semi-complex procedure that your servlet is performing, it could fail at any time. If it fails after rendering half of the output, you will be unable to do any of the following things:
Issue an "error" HTTP status code (e.g. 500 Server Error)
Redirect the user to another page (error page?)
Show a nice error message on the screen without ruining/interrupting the page
So, even though the manually-buffered approach (based upon the StringBuilder) is less efficient, I believe it gives you a great deal of flexibility for handling errors.
This is more of a religious argument than anything else, but you'll find many web application programmers who would say that your servlet should produce no output at all, and the task of generating responses should be delegated to another component more suited to the task (e.g. JSP, Velocity, FreeMarker, etc.).
If you are, however, writing a servlet with an eye towards raw speed, then by all means: write directly to the output stream. It will give you the best performance in both micro-benchmarks and overall speed under load.
EDIT 2016-01-26
When [are] these buffers flushed?
The servlet spec makes no guarantees about whether or not the ServletOutputStream is buffered or not, but not using a buffer would be a practical mistake: sending TCP packets one-character-at-a-time would certainly be awful for performance.
If you absolutely need to make sure that the response is buffered, you must use your own BufferedOutputStream, because the servlet container could change its implementation at any time and, as mentioned, is not guaranteed to buffer your response for you.
How exactly is this buffering within Tomcat working?
The buffering currently implemented in Tomcat works just like buffering in the standard JDK classes: when the buffer fills, it's flushed to the lower stream and then the balance of bytes remains in the buffer after the call is made.
If you manually call flush on the stream, you'll force the use of Transfer-Encoding: chunked which means that additional data will need to be sent over the wire, because there is no Content-Length (unless you manually set one before you start filling the buffer). If you can avoid chunked-encoding, you can save yourself some network traffic. Also, if the client knows the Content-Length of the response, they can show an accurate progress bar when downloading the resource. With chunked encoding, the client never knows how much data is coming until it's all been downloaded.

Wrap you servletOutputStream in a BufferedOutputStream (unless it already is) and you don't need to worry about silly things like that.

I would definitely use the first one. The servlet output stream is buffered, so you don't have to worry about sending it too fast. Also you allocate a new string everytime with the second one, which might impose a GC overhead overtime. Use the first one and call flush after the loop.

It's already buffered, and in some cases it is written to a ByteArrayOutputStream so that Tomcat can prepend the Content-Length header. Don't worry about it.

Why is it necessary to flush the output buffer when it was just created?

In the following scenario
ObjectOutputStream output = new ObjectOutputStream(socket.getOutputStream());
output.flush();
// Do stuff with it
Why is it always necessary to flush the buffer after initial creation?
I see this all the time and I don't really understand what has to be flushed. I kind of expect newly created variables to be empty unless otherwise is specified.
Kind of like buying a trash-can and finding a tiny pile of trash inside that came with it.

In over 15 years of writing Java on a professional level I've never once encountered a need to flush a stream before writing to it.
The flush operation would do nothing at all, as there's nothing to flush.
You want to flush the stream before closing it, though the close operation should do that for you it is often considered best practice to do it explicitly (and I have encountered situations where that did make a difference, where apparently the close operation did not actually do a flush first.
Maybe you are confused with that?

When you write data out to a stream, some amount of buffering will occur, and you never know for sure exactly when the last of the data will actually be sent. You might perform many rite operations on a stream before closing it, and invoking the flush()method guarantees that the last of the data you thought you had already written actually gets out to the file. Whenever you're done using a file, either reading it or writing to it, you should invoke the close()method. When you are doing file I/O you're using expensive and limited operating system resources, and so when you're done, invoking close()will free up those resources.

This is needed when using either ObjectInputStream and ObjectOutputStream, because they send a header over the stream before the first write is called. The call to flush() will send that header to the remote side.
According to the spec, the header exists of the following contents:
magic version
If the header doesn't arrive at the moment a ObjectInputStream is build, this call will hang until it received the header bytes.
This means that if the protocol in question is written with ObjectStreams, it should flush after creating a ObjectOutputStream.

JAVA : BufferdInputStream and BufferedOutputStream

I have several questions-
1. I have two computers connected by socket connection. When the program executes
outputStream.writeInt(value);
outputStream.flush();
what actually happens? Does the program wait until the other computer reads the integer value?
2. How can I empty the outputStream or inputStream? Meaning, when emptying
the outputStream or inputStream, whatever is written to that stream gets removed.
(please don't suggest to do it by closing the connection!)
I tried to empty the inputStream this way-
byte[] eatup=new byte[20*1024];
int available=0;
while(true)
{
available=serverInputStream.available();
if(available==0)
break;
serverInputStream.read(eatup,0,available);
}
eatup=null;
String fileName=(String)serverInputStream.readObject();
Program should not process the line as nothing else is being written on the outputStream .
But my program executes it anyway and throws a java.io.OptionalDataException error.
Note: I am working on a client-server file transfer project. The client sends files to
the server. The second code is for server terminal. If 'cancel button' is pressed on server
end then it stops reading bytes from the serverInputStream and sends a signal(I used int -1)
to the client. When client receieves this signal it stops sending data to the server, but I've
noticed that serverInputStream is not empty. So I need to empty this serverInputStream so that
the client computer is able to send the server computer files again(That's why I can't manage a lock
from read method)

1 - No. On the flush() the data will be written to the OS kernel which will likely immediately hand it to the network card driver, which in turn will send it to the receiving end. In a nutshell the send is fire and forget.
2 - As Jeffrey commented available() is not reliable for this sort of operation. If doing blocking IO then as he suggests you should just use read() speculatively. However it should be said that you really need to define a protocol on top of the raw streams, even if it's just using DataInput/DataOutputStream. When using raw write/read the golden rule is one write != one read. For example if you were to write 10 bytes on one side and had a reading loop on the other there is no guarantee that one read will read all 10 bytes. It may be "read" as any combination of chunks. Similarly two writes of 10 bytes might appear as one read of 20 bytes on the receiving side. Put another way there is no concept of a "packet" unless you create a higher level protocol on top of the raw bytes to do packets. An example would be each send is prefixed by a byte length so the receiving side knows how much data to expect in the current packet.
If you do need to do anything more complicated than a basic apps I strongly encourage you to investigate some higher level libraries that have solved many of the gnarly issues of network IO. I would recommend Netty which I use for production apps. However it is quite a big leap in understanding from a simple IO stream to Netty's more event based system. There may be other libraries somewhere in the middle.

How to delete buffered data in a buffered OutputStream?

It is possible to skip data from an InputStream
in.skip(in.available());
but if you want to do something similar with OutputStream I've found
socket.getOutputStream().flush();
But that's not the same, flush will transmit the buffered data instead of ignoring it.
Is there any possibility of deleting buffered data?
Thanks
EDIT
The situation is a client-server application, when a new command is send (from client)
it (try) to be sure that the answer read will correspond to the last command sent.
Some commands are sent by (human-fired) events, and others are sent by automatic threads.
If a command is on buffer and a new one is send then the answer will be for the first one, causing desynchronization.
Of course a synchronized method plus a flag called "waitingCommand" could be the safer approach but as the communication is not reliable, this approach is slow (depends on timeouts ). That's why I've asked for the skip method.

You can't remove data you could have sent. You can write the data into an in-memory OutputStream like ByteArrayOutputStream and copy only the portions you want.

I'm no sure if it makes sense, but you can try:
class MyBufferedOutputStream extends java.io.BufferedOutputStream {
public MyBufferedOutputStream(OutputStream out) {
super(out);
}
/** throw away everything in a buffer without writing it */
public synchronized void skip() {
count = 0;
}
}

What does it mean to "skip" outputting data?
Once the data is in the buffer, there's no way to get it back or remove it. I suggest checking if you want to skip the data before you write it to the OutputStream. Either that, or have your own secondary buffer that you can modify at will.

This question doesn't make any sense. Throwing away pending requests will just make your application protocol problem worse. What happens to the guy that is waiting for the response to the request that got deleted? What happened to the functionality that that request was supposed to implement? You need to rethink all this from another point of view. If you have a single connection to a server that is executing request/response transactions for this client, the protocol is already sequential. You will have to synchronize on e.g. the socket at the point of writing & flushing the request and reading the response, but you're not losing any performance by this as the processing at the other end is sequentialized anyway. You don't need a 'waitingCommand' flag as well, just synchronization.

Since you are controlling the data written to OutputStream, just don't write pieces that you don't need. OutputStream by contract, does not ensure when data is actually written, so it doesn't make much sense to have skip method.

The best you can do to "ignore" output data, is not to write it at first.

Java input stream limit: protecting against DoS attacks

I'm coding a tool, that, given any URL, would periodically fetch its output. The problem is that an output could be not a simple and lightweight HTML page (expected in most cases), but some heavy data stream (i.e. straight from /dev/urandom, possible DoS attack).
I'm using java.net.URL + java.net.URLConnection, setting connection and read timeouts to 30sec. Currently input is being read by java.io.BufferedReader, using readLine().
Possible solutions:
Use java.io.BufferedReader.read() byte by byte, counting them and closing connection after limit has been reached. The problem is that an attacker may transmit one byte every 29sec, so that read/connection timeout would almost never occur (204800B * 29sec = 68 days)
Limit Thread execution to 1-5min and use java.io.BufferedReader.readLine(). Any problems here?
I feel like trying to reinvent the wheel and the solution is very straightforward, just doesn't come to my mind.
Thanks in advance.

You could encapsulatebhhis by writing yourself a FilterInputStream that enforces whatever you want to enforce and placing it at the bottom of the stack, around the connection output stream
However this and the remedies you suggest only work if the output is arriving in chunked transfer mode. Otherwise HttpURLConnection can buffer the entire response before you read any of it. The usual solution to this is a filter in the firewall.

There seems to be a number of avenues for denial of service here.
A huge big line that gobbles memory. Probably the easiest is to use a MeteredInputStream before even hitting the character decoding. Reading char by char will be extremely slow in any circumstance. You could read a long char[] at a time, but that will likely over complicate the code.
Dealing with an adversary (or bug) keeping many connections alive at once. You probably want non-blocking I/O reading the whole message, and then proceed normally.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.