How I can reduce run time of this code? - java

InetAddress localhost = null;
try {
localhost = InetAddress.getLocalHost();
} catch (UnknownHostException ex) {
/* Purposely empty */
}
byte[] ip = localhost.getAddress();
int i = 1;
while (i <= 254) {
ip[3] = (byte) i;
InetAddress address = null;
try {
address = InetAddress.getByAddress(ip);
} catch (UnknownHostException ex) {
/* Purposely empty */
}
String HostName = address.getHostName();
if (!address.getHostAddress().equals(address.getHostName())) {
list.addElement(HostName);
}
i++;
}
(I have problem is long the run time. How I can reduce the run time in this code)

I had a similar issue involving network lookups for IP Addresses. The issue of network latency is just as "that other guy" said...it's driven by the network. How long and how many hops it takes to get to the destination.
The only solution I found was threading out the lookups, InetAddress.getByAddress(ip) in your case. My solution was to setup an ExecutorService with 10 threads. Package each InetAddress.getByAddress(ip) into a Callable. Monitor the Callable for completion. package another one and start it. Have a look at one of my questions on this forum related to this very issue :
ExecutorService - How to set values in a Runnable/Callable and reuse
Be cautious with the ExecutorService (as I found out). The number of threads really depends on the number of CPUs (Power) of your runtime hardware. Too many threads and it'll grind to halt (trust me on this). Too few threads and your time reduction may not get reached.
I've left the department in my Company where I implemented the final solution, so I don't have the code readily available. But the above link provides some basic code on the ExecutorService and using Callable objects.

Related

Nearly no performance gain between single and multiple consumers using LMAX Disruptor / how to decode many UDP packets properly

I have to transfer larger files (upto 10GB) using UDP. Unfortunately TCP cannot be used in this use case because there is no bidirectional communication between sender and receiver possible.
Sending a file is not the problem. I have written the client using netty. It reads the file, encodes it (unique ID, position in stream and so on) and sends it to the destination at a configurable rate (packets per seconds). All the packets are received at the destination. I have used iptables and Wireshark to verify that.
The problem occurs with the recipient. Receiving upto 90K packets a second works pretty fine. But receiving and decoding it at this rate is not possible using a single thread.
My first approach was to use thread safe queues (one producer and multiple consumer). But using multiple consumers did not lead to better results. Some packets were still lost. It seems that the overhead (locking/unlocking the queue) slows down the process. So I decided to use lmax disruptor with a single producer (receiving the UDP datagrams) and multiple consumer (decoding the packet). But surprisingly, this does not lead to success either. It is hardly a speed advantage to use two lmax consumers and I wonder why.
This is main part receiving UDP packets and call the disruptor
public void receiveUdpStream(DatagramChannel channel) {
boolean exit = false;
// the size of the UDP datagram
int size = shareddata.cr.getDatagramsize();
// the number of decoders (configurable)
int nn_decoders = shareddata.cr.getDecoders();
Udp2flowEventFactory factory = new Udp2flowEventFactory(size);
// the size of the ringbuffer
int bufferSize = 1 << 10;
Disruptor<Udp2flowEvent> disruptor = new Disruptor<>(
factory,
bufferSize,
DaemonThreadFactory.INSTANCE,
ProducerType.SINGLE,
new YieldingWaitStrategy());
// my consumers
Udp2flowDecoder decoder[] = new Udp2flowDecoder[nn_decoders];
for (int i = 0; i < nn_decoders; i++) {
decoder[i] = new Udp2flowDecoder(i, shareddata);
}
disruptor.handleEventsWith(decoder);
RingBuffer<Udp2flowEvent> ringBuffer = disruptor.getRingBuffer();
Udp2flowProducer producer = new Udp2flowProducer(ringBuffer);
disruptor.start();
while (!exit) {
try {
ByteBuffer buf = ByteBuffer.allocate(size);
channel.receive(buf);
receivedDatagrams++; // countig the received packets
buf.flip();
producer.onData(buf);
} catch (Exception e) {
logger.debug("got exeception " + e);
exit = true;
}
}
}
My lmax event is simple...
public class Udp2flowEvent {
ByteBuffer buf;
Udp2flowEvent(int size) {
this.buf = ByteBuffer.allocateDirect(size);
}
public void set(ByteBuffer buf) {
this.buf = buf;
}
public ByteBuffer getEvent() {
return this.buf;
}
}
And this is my factory
public class Udp2flowEventFactory implements EventFactory<Udp2flowEvent> {
private int size;
Udp2flowEventFactory(int size) {
super();
this.size = size;
}
public Udp2flowEvent newInstance() {
return new Udp2flowEvent(size);
}
}
The producer ...
public class Udp2flowProducer {
private final RingBuffer<Udp2flowEvent> ringBuffer;
public Udp2flowProducer(RingBuffer<Udp2flowEvent> ringBuffer)
{
this.ringBuffer = ringBuffer;
}
public void onData(ByteBuffer buf)
{
long sequence = ringBuffer.next(); // Grab the next sequence
try
{
Udp2flowEvent event = ringBuffer.get(sequence);
event.set(buf);
}
finally
{
ringBuffer.publish(sequence);
}
}
}
The interesting but very simple part is the decoder. It looks like this.
public void onEvent(Udp2flowEvent event, long sequence, boolean endOfBatch) {
// each consumer decodes its packets
if (sequence % nn_decoders != decoderid) {
return;
}
ByteBuffer buf = event.getEvent();
event = null; // is it faster to null the event?
shareddata.increaseReceiveddatagrams();
// headertype
// some code omitted. But the code looks something like this
final int headertype = buf.getInt();
final int headerlength = buf.getInt();
final long payloadlength = buf.getLong();
// decoding int and longs works fine.
// but decoding the remaining part not!
byte[] payload = new byte[buf.remaining()];
buf.get(payload);
// some code omitted. The payload is used later on...
}
And here are some interesting facts:
all decoders work well. I see the number of decoders running
all packets are received but the decoding takes too long. More precisely: decoding the first two ints and the long value works fine but decoding the payload takes too long. This leads to a 'backpressure' and some packets are lost.
Fun fact: The code works pretty fine on my MacBook Air but does not work on my server. (MacBook: Core i7; Server: ESXi with 8 virtual Cores on a Xeon #2.6Ghz and no load at all).
Now my questions and I hope that somebody has an idea:
why does it hardly make a difference to use several consumers? The difference is only 5%
In general: What is the best way to receive 60K (or more) UDP packets and decode it? I tried netty as receiver but UDP does not scale very well.
Why is decoding the payload so slow?
Are there any errors that I have overlooked?
Should I use another producer / consumer library? LMAX has a very low latency but what's about throughput?
Ring Buffers don't seem like the right tec for this problem because when a ring buffer has filled all it's capacity it will block and it is also an inherently sequential architecture. You need to know in advance the highest number of packets to expect and size for that. Also UDP is lossy unless you implement a message assurance protocol.
Not sure why you say TCP is not bidirectional, it is and it takes care of lost packets.
To cope with data flooding, you may need to distribute the incoming packets to separate servers if a single one is insufficient. A queue should work to absorb a flood of data. You may need a massive number of decoders awaiting if you want to process this volume of data in near real time.
Suggest you use TCP.

How to investigate Java socket program performance issue

I have two variations of the same set of Java programs [Server.java and Client.java] and [ServerTest.java and ClientTest.java]. They both do the same thing, the client connects to the server and sends pairs of integers across to the server to be multiplied and the result returned to the client, where it is then printed. This is performed 100 times each.
However, in the Test version, I create and close a new socket for each passing of an integer pair and their multiplication (100 multiplications are performed). In the normal version, I open a single persistent socket and perform all interaction with the client and close afterward.
Intuitively, I thought the approach where I create one persistent socket would be a little faster than creating, accepting and closing a socket each time - in reality, the approach where a new socket is created, accepted and closed is noticeably faster. On average, the persistent socket approach takes around 8 seconds, whereas the approach that creates a new socket every time takes around 0.4 seconds.
I checked the system call activity of both and noticed nothing different between the two. I then tested the same programs on another computer (macOS Sierra) and there was a neglible difference between the two. So it seems the problem doesn't even lie with the application code but how it interacts with the OS (I'm running Ubuntu LTS 16.04).
Does anyone know why there is such a difference in performance here, or how the issue could be investigated further? I've also checked system wide metrics (memory usage and CPU usage) when executing the programs and there seems to be plenty of memory and the CPU's have plenty of idle time.
See the code snippet of how both approaches differ below:
Creating new socket every time approach:
// this is called one hundred times
public void listen() {
try {
while (true) {
// Listens for a connection to be made to this socket.
Socket socket = my_serverSocket.accept();
DataInputStream in = new DataInputStream(socket
.getInputStream());
// Read in the numbers
int numberOne = in.readInt();
int numberTwo = in.readInt();
int result = numberOne * numberTwo;
DataOutputStream out = new DataOutputStream(socket.getOutputStream());
out.writeInt(result);
// tidy up
socket.close();
}
} catch (IOException ioe) {
ioe.printStackTrace();
} catch (SecurityException se) {
se.printStackTrace();
}
}
Persistent socket approach:
public void listen() {
try {
while (true) {
// Listens for a connection to be made to this socket.
Socket socket = my_serverSocket.accept();
for (int i = 0; i < 100; i++) {
DataInputStream in = new DataInputStream(socket
.getInputStream());
// Read in the numbers
int numberOne = in.readInt();
int numberTwo = in.readInt();
int result = numberOne * numberTwo;
DataOutputStream out = new DataOutputStream(socket.getOutputStream());
out.writeInt(result);
}
// tidy up
socket.close();
}
} catch (IOException ioe) {
ioe.printStackTrace();
} catch (SecurityException se) {
se.printStackTrace();
}
}
You didn't show us the code that is sending the integers for multiplication. Do you happen to have a loop in it in which in each iteration you send a pair and receive the result? If so make sure to turn off the Nagle's algorithm.
The Nagle's algorithm tries to overcome the "small-packet problem", i.e. when an application repeatedly emits data in small chunks. This leads to huge overhead, since packet header is often much larger than the data itself. The algorithm essentially combines a number of small outgoing messages and sends them all at once. If not enough data was gathered then the algorithm may still send the message, but only if some timeout has elapsed.
In your case, you were writing small chunks of data into the socket on both the client and the server side. The data weren't immediately transmitted. Rather the socket waited for more data to come (which didn't), so each time a timeout had to elapse.
Actually, the only difference between these 2 pieces of code is NOT how they handle incoming connections (by having one persistent socket or not), the difference is that in the one that you call "persistent", 100 pairs of numbers are multiplied, whereas in the other one, only 1 pair of number is multiplied then returned. This could explain the difference in time.

Java: Sending messages to a JMS queue with multiple threads

I am trying to write a Java class to both send and read messages from a JMS queue using multiple threads to speed things up. I have the below code.
System.out.println("Sending messages");
long startTime = System.nanoTime();
Thread threads[] = new Thread[NumberOfThreads];
for (int i = 0; i < threads.length; i ++) {
threads[i] = new Thread() {
public void run() {
try {
for (int i = 0; i < NumberOfMessagesPerThread; i ++) {
sendMessage("Hello");
}
} catch (Exception e) {
e.printStackTrace();
}
}
};
threads[i].start();
}
//Block until all threads are done so we can get total time
for (Thread thread : threads) {
thread.join();
}
long endTime = System.nanoTime();
long duration = (endTime - startTime) / 1000000;
System.out.println("Done in " + duration + " ms");
This code works and sends however many messages to my JMS queue that I say (via NumberOfThreads and NumberOfMessagesPerThread). However, I am not convinced it is truly working multithreaded. For example, if I set threads to 10 and messages to 100 (so 1000 total messages), it takes the same time as 100 threads and 10 messages each. Even this code below takes the same time.
for (int i = 0; i < 1000; i ++) {
sendMessage("Hello");
}
Am I doing the threading right? I would expect the multithreaded code to be much faster than just a plain for loop.
Are you sharing a single connection (a single Producer) across all threads? If so then probably you are hitting some thread contention in there and you are limited to the speed of the socket connection between your producer and your broker. Of course, it will depend much on the jms implementation you are using (and if you are using asyncSends or not).
I will recommend you to repeat your tests using completely separate producers (although, you will lose the "queue" semantic in terms of ordering of messages, but I guess that is expected).
Also, I do not recommend running performance tests with numbers so high like 100 threads. Remember that your multithread capability it at some point limited by the amount of cores you machine has (more or less, you are having also a lot of IO in here so it might help to have a few more threads than cores, but a 100 is not really a good number in my opinion)
I would also review some of the comments in this post Single vs Multi-threaded JMS Producer
What is the implementation of 'sendMessage'. How are the connections, session, and producers being reused?

What could cause a java process to get gradually decreasing share of CPU?

I have a very simple java program that prints out 1 million random numbers. In linux, I observed the %CPU that this program takes during its lifespan, it starts off at 98% then gradually decreases to 2%, thus causing the program to be very slow. What are some of the factors that might cause the program to gradually get less CPU time?
I've tried running it with nice -20 but I still see the same results.
EDIT: running the program with /usr/bin/time -v I'm seeing an unusual amount of involuntary context switches (588 voluntary vs 16478 involuntary), which suggests that the OS is letting some other higher priority process run.
It boils down to two things:
I/O is expensive, and
Depending on how you're storing the numbers as you go along, that can have an adverse effect on performance as well.
If you're mainly doing System.out.println(randInt) in a loop a million times, then that can get expensive. I/O isn't one of those things that comes for free, and writing to any output stream costs resources.
I would start by profiling via JConsole or VisualVM to see what it's actually doing when it has low CPU %. As mentioned in comments there's a high chance it's blocking, e.g. waiting for IO (user input, SQL query taking a long time, etc.)
If your application is I/O bound - for example waiting for responses from network calls, or disk read/write
If you want to try and balance everything, you should create a queue to hold numbers to print, then have one thread generate them (the producer) and the other read and print them (the consumer). This can easily be done with a LinkedBlockingQueue.
public class PrintQueueExample {
private BlockingQueue<Integer> printQueue = new LinkedBlockingQueue<Integer>();
public static void main(String[] args) throws InterruptedException {
PrinterThread thread = new PrinterThread();
thread.start();
for (int i = 0; i < 1000000; i++) {
int toPrint = ...(i) ;
printQueue.put(Integer.valueOf(toPrint));
}
thread.interrupt();
thread.join();
System.out.println("Complete");
}
private static class PrinterThread extends Thread {
#Override
public void run() {
try {
while (true) {
Integer toPrint = printQueue.take();
System.out.println(toPrint);
}
} catch (InterruptedException e) {
// Interruption comes from main, means processing numbers has stopped
// Finish remaining numbers and stop thread
List<Integer> remainingNumbers = new ArrayList<Integer>();
printQueue.drainTo(remainingNumbers);
for (Integer toPrint : remainingNumbers)
System.out.println(toPrint);
}
}
}
}
There may be a few problems with this code, but this is the gist of it.

Cyclic barrier Java, How to verify?

I am preparing for interviews and just want to prepare some basic threading examples and structures so that I can use them during my white board coding if I have to.
I was reading about CyclicBarrier and was just trying my hands at it, so I wrote a very simple code:
import java.util.concurrent.CyclicBarrier;
public class Threads
{
/**
* #param args
*/
public static void main(String[] args)
{
// ******************************************************************
// Using CyclicBarrier to make all threads wait at a point until all
// threads reach there
// ******************************************************************
barrier = new CyclicBarrier(N);
for (int i = 0; i < N; ++i)
{
new Thread(new CyclicBarrierWorker()).start();
}
// ******************************************************************
}
static class CyclicBarrierWorker implements Runnable
{
public void run()
{
try
{
long id = Thread.currentThread().getId();
System.out.println("I am thread " + id + " and I am waiting for my friends to arrive");
// Do Something in the Thread
Thread.sleep(1000*(int)(4*Math.random()*10));
// Now Wait till all the thread reaches this point
barrier.await();
}
catch (Exception e)
{
e.printStackTrace();
}
//Now do whatever else after all threads are released
long id1 = Thread.currentThread().getId();
System.out.println("Thread:"+id1+" We all got released ..hurray!!");
System.out.println("We all got released ..hurray!!");
}
}
final static int N = 4;
static CyclicBarrier barrier = null;
}
You can copy paste it as is and run in your compiler.
What I want to verify is that indeed all threads wait at this point in code:
barrier.await();
I put some wait and was hoping that I would see 4 statements appear one after other in a sequential fashion on the console, followed by 'outburst' of "released..hurray" statement. But I am seeing outburst of all the statements together no matter what I select as the sleep.
Am I missing something here ?
Thanks
P.S: Is there an online editor like http://codepad.org/F01xIhLl where I can just put Java code and hit a button to run a throw away code ? . I found some which require some configuration before I can run any code.
The code looks fine, but it might be more enlightening to write to System.out before the sleep. Consider this in run():
long id = Thread.currentThread().getId();
System.out.println("I am thread " + id + " and I am waiting for my friends to arrive");
// Do Something in the Thread
Thread.sleep(1000*8);
On my machine, I still see a burst, but it is clear that the threads are blocked on the barrier.
if you want to avoid the first burst use a random in the sleep
Thread.sleep(1000*(int)(8*Math.rand()));
I put some wait and was hoping that I
would see 4 statements appear one
after other in a sequential fashion on
the console, followed by 'outburst' of
"released..hurray" statement. But I am
seeing outburst of all the statements
together no matter what I select as
the sleep.
The behavior I'm observing is that all the threads created, sleep for approximately the same amount of time. Remember that other threads can perform their work in the interim, and will therefore get scheduled; since all threads created sleep for the same amount of time, there is very little difference between the instants of time when the System.out.println calls are invoked.
Edit: The other answer of sleeping of a random amount of time will aid in understanding the concept of a barrier better, for it would guarantee (to some extent) the possibility of multiple threads arriving at the barrier at different instants of time.

Categories