How to investigate Java socket program performance issue - java

I have two variations of the same set of Java programs [Server.java and Client.java] and [ServerTest.java and ClientTest.java]. They both do the same thing, the client connects to the server and sends pairs of integers across to the server to be multiplied and the result returned to the client, where it is then printed. This is performed 100 times each.
However, in the Test version, I create and close a new socket for each passing of an integer pair and their multiplication (100 multiplications are performed). In the normal version, I open a single persistent socket and perform all interaction with the client and close afterward.
Intuitively, I thought the approach where I create one persistent socket would be a little faster than creating, accepting and closing a socket each time - in reality, the approach where a new socket is created, accepted and closed is noticeably faster. On average, the persistent socket approach takes around 8 seconds, whereas the approach that creates a new socket every time takes around 0.4 seconds.
I checked the system call activity of both and noticed nothing different between the two. I then tested the same programs on another computer (macOS Sierra) and there was a neglible difference between the two. So it seems the problem doesn't even lie with the application code but how it interacts with the OS (I'm running Ubuntu LTS 16.04).
Does anyone know why there is such a difference in performance here, or how the issue could be investigated further? I've also checked system wide metrics (memory usage and CPU usage) when executing the programs and there seems to be plenty of memory and the CPU's have plenty of idle time.
See the code snippet of how both approaches differ below:
Creating new socket every time approach:
// this is called one hundred times
public void listen() {
try {
while (true) {
// Listens for a connection to be made to this socket.
Socket socket = my_serverSocket.accept();
DataInputStream in = new DataInputStream(socket
.getInputStream());
// Read in the numbers
int numberOne = in.readInt();
int numberTwo = in.readInt();
int result = numberOne * numberTwo;
DataOutputStream out = new DataOutputStream(socket.getOutputStream());
out.writeInt(result);
// tidy up
socket.close();
}
} catch (IOException ioe) {
ioe.printStackTrace();
} catch (SecurityException se) {
se.printStackTrace();
}
}
Persistent socket approach:
public void listen() {
try {
while (true) {
// Listens for a connection to be made to this socket.
Socket socket = my_serverSocket.accept();
for (int i = 0; i < 100; i++) {
DataInputStream in = new DataInputStream(socket
.getInputStream());
// Read in the numbers
int numberOne = in.readInt();
int numberTwo = in.readInt();
int result = numberOne * numberTwo;
DataOutputStream out = new DataOutputStream(socket.getOutputStream());
out.writeInt(result);
}
// tidy up
socket.close();
}
} catch (IOException ioe) {
ioe.printStackTrace();
} catch (SecurityException se) {
se.printStackTrace();
}
}

You didn't show us the code that is sending the integers for multiplication. Do you happen to have a loop in it in which in each iteration you send a pair and receive the result? If so make sure to turn off the Nagle's algorithm.
The Nagle's algorithm tries to overcome the "small-packet problem", i.e. when an application repeatedly emits data in small chunks. This leads to huge overhead, since packet header is often much larger than the data itself. The algorithm essentially combines a number of small outgoing messages and sends them all at once. If not enough data was gathered then the algorithm may still send the message, but only if some timeout has elapsed.
In your case, you were writing small chunks of data into the socket on both the client and the server side. The data weren't immediately transmitted. Rather the socket waited for more data to come (which didn't), so each time a timeout had to elapse.

Actually, the only difference between these 2 pieces of code is NOT how they handle incoming connections (by having one persistent socket or not), the difference is that in the one that you call "persistent", 100 pairs of numbers are multiplied, whereas in the other one, only 1 pair of number is multiplied then returned. This could explain the difference in time.

Related

Nearly no performance gain between single and multiple consumers using LMAX Disruptor / how to decode many UDP packets properly

I have to transfer larger files (upto 10GB) using UDP. Unfortunately TCP cannot be used in this use case because there is no bidirectional communication between sender and receiver possible.
Sending a file is not the problem. I have written the client using netty. It reads the file, encodes it (unique ID, position in stream and so on) and sends it to the destination at a configurable rate (packets per seconds). All the packets are received at the destination. I have used iptables and Wireshark to verify that.
The problem occurs with the recipient. Receiving upto 90K packets a second works pretty fine. But receiving and decoding it at this rate is not possible using a single thread.
My first approach was to use thread safe queues (one producer and multiple consumer). But using multiple consumers did not lead to better results. Some packets were still lost. It seems that the overhead (locking/unlocking the queue) slows down the process. So I decided to use lmax disruptor with a single producer (receiving the UDP datagrams) and multiple consumer (decoding the packet). But surprisingly, this does not lead to success either. It is hardly a speed advantage to use two lmax consumers and I wonder why.
This is main part receiving UDP packets and call the disruptor
public void receiveUdpStream(DatagramChannel channel) {
boolean exit = false;
// the size of the UDP datagram
int size = shareddata.cr.getDatagramsize();
// the number of decoders (configurable)
int nn_decoders = shareddata.cr.getDecoders();
Udp2flowEventFactory factory = new Udp2flowEventFactory(size);
// the size of the ringbuffer
int bufferSize = 1 << 10;
Disruptor<Udp2flowEvent> disruptor = new Disruptor<>(
factory,
bufferSize,
DaemonThreadFactory.INSTANCE,
ProducerType.SINGLE,
new YieldingWaitStrategy());
// my consumers
Udp2flowDecoder decoder[] = new Udp2flowDecoder[nn_decoders];
for (int i = 0; i < nn_decoders; i++) {
decoder[i] = new Udp2flowDecoder(i, shareddata);
}
disruptor.handleEventsWith(decoder);
RingBuffer<Udp2flowEvent> ringBuffer = disruptor.getRingBuffer();
Udp2flowProducer producer = new Udp2flowProducer(ringBuffer);
disruptor.start();
while (!exit) {
try {
ByteBuffer buf = ByteBuffer.allocate(size);
channel.receive(buf);
receivedDatagrams++; // countig the received packets
buf.flip();
producer.onData(buf);
} catch (Exception e) {
logger.debug("got exeception " + e);
exit = true;
}
}
}
My lmax event is simple...
public class Udp2flowEvent {
ByteBuffer buf;
Udp2flowEvent(int size) {
this.buf = ByteBuffer.allocateDirect(size);
}
public void set(ByteBuffer buf) {
this.buf = buf;
}
public ByteBuffer getEvent() {
return this.buf;
}
}
And this is my factory
public class Udp2flowEventFactory implements EventFactory<Udp2flowEvent> {
private int size;
Udp2flowEventFactory(int size) {
super();
this.size = size;
}
public Udp2flowEvent newInstance() {
return new Udp2flowEvent(size);
}
}
The producer ...
public class Udp2flowProducer {
private final RingBuffer<Udp2flowEvent> ringBuffer;
public Udp2flowProducer(RingBuffer<Udp2flowEvent> ringBuffer)
{
this.ringBuffer = ringBuffer;
}
public void onData(ByteBuffer buf)
{
long sequence = ringBuffer.next(); // Grab the next sequence
try
{
Udp2flowEvent event = ringBuffer.get(sequence);
event.set(buf);
}
finally
{
ringBuffer.publish(sequence);
}
}
}
The interesting but very simple part is the decoder. It looks like this.
public void onEvent(Udp2flowEvent event, long sequence, boolean endOfBatch) {
// each consumer decodes its packets
if (sequence % nn_decoders != decoderid) {
return;
}
ByteBuffer buf = event.getEvent();
event = null; // is it faster to null the event?
shareddata.increaseReceiveddatagrams();
// headertype
// some code omitted. But the code looks something like this
final int headertype = buf.getInt();
final int headerlength = buf.getInt();
final long payloadlength = buf.getLong();
// decoding int and longs works fine.
// but decoding the remaining part not!
byte[] payload = new byte[buf.remaining()];
buf.get(payload);
// some code omitted. The payload is used later on...
}
And here are some interesting facts:
all decoders work well. I see the number of decoders running
all packets are received but the decoding takes too long. More precisely: decoding the first two ints and the long value works fine but decoding the payload takes too long. This leads to a 'backpressure' and some packets are lost.
Fun fact: The code works pretty fine on my MacBook Air but does not work on my server. (MacBook: Core i7; Server: ESXi with 8 virtual Cores on a Xeon #2.6Ghz and no load at all).
Now my questions and I hope that somebody has an idea:
why does it hardly make a difference to use several consumers? The difference is only 5%
In general: What is the best way to receive 60K (or more) UDP packets and decode it? I tried netty as receiver but UDP does not scale very well.
Why is decoding the payload so slow?
Are there any errors that I have overlooked?
Should I use another producer / consumer library? LMAX has a very low latency but what's about throughput?
Ring Buffers don't seem like the right tec for this problem because when a ring buffer has filled all it's capacity it will block and it is also an inherently sequential architecture. You need to know in advance the highest number of packets to expect and size for that. Also UDP is lossy unless you implement a message assurance protocol.
Not sure why you say TCP is not bidirectional, it is and it takes care of lost packets.
To cope with data flooding, you may need to distribute the incoming packets to separate servers if a single one is insufficient. A queue should work to absorb a flood of data. You may need a massive number of decoders awaiting if you want to process this volume of data in near real time.
Suggest you use TCP.

How I can reduce run time of this code?

InetAddress localhost = null;
try {
localhost = InetAddress.getLocalHost();
} catch (UnknownHostException ex) {
/* Purposely empty */
}
byte[] ip = localhost.getAddress();
int i = 1;
while (i <= 254) {
ip[3] = (byte) i;
InetAddress address = null;
try {
address = InetAddress.getByAddress(ip);
} catch (UnknownHostException ex) {
/* Purposely empty */
}
String HostName = address.getHostName();
if (!address.getHostAddress().equals(address.getHostName())) {
list.addElement(HostName);
}
i++;
}
(I have problem is long the run time. How I can reduce the run time in this code)
I had a similar issue involving network lookups for IP Addresses. The issue of network latency is just as "that other guy" said...it's driven by the network. How long and how many hops it takes to get to the destination.
The only solution I found was threading out the lookups, InetAddress.getByAddress(ip) in your case. My solution was to setup an ExecutorService with 10 threads. Package each InetAddress.getByAddress(ip) into a Callable. Monitor the Callable for completion. package another one and start it. Have a look at one of my questions on this forum related to this very issue :
ExecutorService - How to set values in a Runnable/Callable and reuse
Be cautious with the ExecutorService (as I found out). The number of threads really depends on the number of CPUs (Power) of your runtime hardware. Too many threads and it'll grind to halt (trust me on this). Too few threads and your time reduction may not get reached.
I've left the department in my Company where I implemented the final solution, so I don't have the code readily available. But the above link provides some basic code on the ExecutorService and using Callable objects.

Thread.Sleep crashes my app

This app talks to a serial device over an usb to serial dongle. I have been able to get it to process my single queries no problem but I have a command that will send multiple queries to the serial device and It seems to me the buffer if getting overrun. Here is part of my code:
This is my array with 20 query commands:
String [] stringOneArray = {":000101017d", ":0001060178", ":00010B016C", ":000110017D",
":0001150178", ":00011A016C", ":00011F0167", ":0001240178", ":0001290173",
":00012E0167", ":0001330178", ":0001380173", ":00013D0167", ":0001420178",
":0001470173", ":00014C0167", ":0001510178", ":0001560173", ":00015B0167", ":0001600178"};
This is how I use the array:
getVelocitiesButton.setOnClickListener(new View.OnClickListener() {
public void onClick(View v) {
ftDev.setLatencyTimer((byte) 16);
int z;
for (z = 0; z < 19; z++) {
String writeData = (String) stringOneArray[z];
byte[] OutData = writeData.getBytes();
ftDev.write(OutData, writeData.length());
try {
Thread.sleep(50);
} catch (InterruptedException e) { }
}
}
});
Not sure the rest of the code is necessary but will add it if needed.
So ftdev is my serial device. It sends the query command to the serial device, it receives the response in bytes, I use a For loop to build the response until all bytes (31 bytes per response) then I process that response and at that time it should receive the second query command from the array, so on until the last command is sent.. It is all fine an dandy if I allow the FOR loop to send only one or 2 queries but with a larger number of array index and it crashes. Figured I just slow down the FOR loop and add the thread.sleep but it freezes the app and crashes... What gives? Is there any other way to control the speed to which the commands are sent? I rather send them as soon as it is possible but I am afraid I don't know java as much. This has been so far my major stepping stone in finishing this personal project, been stuck for 2 days researching and trying solutions.
Looks like you're sleeping for ~1000ms (well 950 to be exact because your last operation is not being sent to the serial device) plus the time needed to perform the writes over your serial connection. That's a pretty long time to do nothing. Remove the Thread.sleep(50) call and put the entire contents of the onClick into the run method of the following code:
AsyncTask.execute(new Runnable {
#Override
public void run() {
// talk to device here
}
});
Then, ask a different question about the quick writes crashing your connection.

What could cause a java process to get gradually decreasing share of CPU?

I have a very simple java program that prints out 1 million random numbers. In linux, I observed the %CPU that this program takes during its lifespan, it starts off at 98% then gradually decreases to 2%, thus causing the program to be very slow. What are some of the factors that might cause the program to gradually get less CPU time?
I've tried running it with nice -20 but I still see the same results.
EDIT: running the program with /usr/bin/time -v I'm seeing an unusual amount of involuntary context switches (588 voluntary vs 16478 involuntary), which suggests that the OS is letting some other higher priority process run.
It boils down to two things:
I/O is expensive, and
Depending on how you're storing the numbers as you go along, that can have an adverse effect on performance as well.
If you're mainly doing System.out.println(randInt) in a loop a million times, then that can get expensive. I/O isn't one of those things that comes for free, and writing to any output stream costs resources.
I would start by profiling via JConsole or VisualVM to see what it's actually doing when it has low CPU %. As mentioned in comments there's a high chance it's blocking, e.g. waiting for IO (user input, SQL query taking a long time, etc.)
If your application is I/O bound - for example waiting for responses from network calls, or disk read/write
If you want to try and balance everything, you should create a queue to hold numbers to print, then have one thread generate them (the producer) and the other read and print them (the consumer). This can easily be done with a LinkedBlockingQueue.
public class PrintQueueExample {
private BlockingQueue<Integer> printQueue = new LinkedBlockingQueue<Integer>();
public static void main(String[] args) throws InterruptedException {
PrinterThread thread = new PrinterThread();
thread.start();
for (int i = 0; i < 1000000; i++) {
int toPrint = ...(i) ;
printQueue.put(Integer.valueOf(toPrint));
}
thread.interrupt();
thread.join();
System.out.println("Complete");
}
private static class PrinterThread extends Thread {
#Override
public void run() {
try {
while (true) {
Integer toPrint = printQueue.take();
System.out.println(toPrint);
}
} catch (InterruptedException e) {
// Interruption comes from main, means processing numbers has stopped
// Finish remaining numbers and stop thread
List<Integer> remainingNumbers = new ArrayList<Integer>();
printQueue.drainTo(remainingNumbers);
for (Integer toPrint : remainingNumbers)
System.out.println(toPrint);
}
}
}
}
There may be a few problems with this code, but this is the gist of it.

Read file at a certain rate in Java

Is there an article/algorithm on how I can read a long file at a certain rate?
Say I do not want to pass 10 KB/sec while issuing reads.
A simple solution, by creating a ThrottledInputStream.
This should be used like this:
final InputStream slowIS = new ThrottledInputStream(new BufferedInputStream(new FileInputStream("c:\\file.txt"),8000),300);
300 is the number of kilobytes per second. 8000 is the block size for BufferedInputStream.
This should of course be generalized by implementing read(byte b[], int off, int len), which will spare you a ton of System.currentTimeMillis() calls. System.currentTimeMillis() is called once for each byte read, which can cause a bit of an overhead. It should also be possible to store the number of bytes that can savely be read without calling System.currentTimeMillis().
Be sure to put a BufferedInputStream in between, otherwise the FileInputStream will be polled in single bytes rather than blocks. This will reduce the CPU load form 10% to almost 0. You will risk to exceed the data rate by the number of bytes in the block size.
import java.io.InputStream;
import java.io.IOException;
public class ThrottledInputStream extends InputStream {
private final InputStream rawStream;
private long totalBytesRead;
private long startTimeMillis;
private static final int BYTES_PER_KILOBYTE = 1024;
private static final int MILLIS_PER_SECOND = 1000;
private final int ratePerMillis;
public ThrottledInputStream(InputStream rawStream, int kBytesPersecond) {
this.rawStream = rawStream;
ratePerMillis = kBytesPersecond * BYTES_PER_KILOBYTE / MILLIS_PER_SECOND;
}
#Override
public int read() throws IOException {
if (startTimeMillis == 0) {
startTimeMillis = System.currentTimeMillis();
}
long now = System.currentTimeMillis();
long interval = now - startTimeMillis;
//see if we are too fast..
if (interval * ratePerMillis < totalBytesRead + 1) { //+1 because we are reading 1 byte
try {
final long sleepTime = ratePerMillis / (totalBytesRead + 1) - interval; // will most likely only be relevant on the first few passes
Thread.sleep(Math.max(1, sleepTime));
} catch (InterruptedException e) {//never realized what that is good for :)
}
}
totalBytesRead += 1;
return rawStream.read();
}
}
The crude solution is just to read a chunk at a time and then sleep eg 10k then sleep a second. But the first question I have to ask is: why? There are a couple of likely answers:
You don't want to create work faster than it can be done; or
You don't want to create too great a load on the system.
My suggestion is not to control it at the read level. That's kind of messy and inaccurate. Instead control it at the work end. Java has lots of great concurrency tools to deal with this. There are a few alternative ways of doing this.
I tend to like using a producer consumer pattern for soling this kind of problem. It gives you great options on being able to monitor progress by having a reporting thread and so on and it can be a really clean solution.
Something like an ArrayBlockingQueue can be used for the kind of throttling needed for both (1) and (2). With a limited capacity the reader will eventually block when the queue is full so won't fill up too fast. The workers (consumers) can be controlled to only work so fast to also throttle the rate covering (2).
while !EOF
store System.currentTimeMillis() + 1000 (1 sec) in a long variable
read a 10K buffer
check if stored time has passed
if it isn't, Thread.sleep() for stored time - current time
Creating ThrottledInputStream that takes another InputStream as suggested would be a nice solution.
If you have used Java I/O then you should be familiar with decorating streams. I suggest an InputStream subclass that takes another InputStream and throttles the flow rate. (You could subclass FileInputStream but that approach is highly error-prone and inflexible.)
Your exact implementation will depend upon your exact requirements. Generally you will want to note the time your last read returned (System.nanoTime). On the current read, after the underlying read, wait until sufficient time has passed for the amount of data transferred. A more sophisticated implementation may buffer and return (almost) immediately with only as much data as rate dictates (be careful that you should only return a read length of 0 if the buffer is of zero length).
You can use a RateLimiter. And make your own implementation of the read in InputStream. An example of this can be seen bellow
public class InputStreamFlow extends InputStream {
private final InputStream inputStream;
private final RateLimiter maxBytesPerSecond;
public InputStreamFlow(InputStream inputStream, RateLimiter limiter) {
this.inputStream = inputStream;
this.maxBytesPerSecond = limiter;
}
#Override
public int read() throws IOException {
maxBytesPerSecond.acquire(1);
return (inputStream.read());
}
#Override
public int read(byte[] b) throws IOException {
maxBytesPerSecond.acquire(b.length);
return (inputStream.read(b));
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
maxBytesPerSecond.acquire(len);
return (inputStream.read(b,off, len));
}
}
if you want to limit the flow by 1 MB/s you can get the input stream like this:
final RateLimiter limiter = RateLimiter.create(RateLimiter.ONE_MB);
final InputStreamFlow inputStreamFlow = new InputStreamFlow(originalInputStream, limiter);
It depends a little on whether you mean "don't exceed a certain rate" or "stay close to a certain rate."
If you mean "don't exceed", you can guarantee that with a simple loop:
while not EOF do
read a buffer
Thread.wait(time)
write the buffer
od
The amount of time to wait is a simple function of the size of the buffer; if the buffer size is 10K bytes, you want to wait a second between reads.
If you want to get closer than that, you probably need to use a timer.
create a Runnable to do the reading
create a Timer with a TimerTask to do the reading
schedule the TimerTask n times a second.
If you're concerned about the speed at which you're passing the data on to something else, instead of controlling the read, put the data into a data structure like a queue or circular buffer, and control the other end; send data periodically. You need to be careful with that, though, depending on the data set size and such, because you can run into memory limitations if the reader is very much faster than the writer.

Categories