Linux - "Too many open files" with pipe, how to debug

Linux - "Too many open files" with pipe, how to debug - java

I have a Java program which will throw 'Too many open files' error after running for about 3 minutes. Increasing the open file limit doesn't work, because it still uses up all the limit, just slower. So there is something wrong with my program and I need to find out.
Here is what I did, 10970 is the pid
Check opened files of the Java process using cat /proc/10970/fd and find out most of them are pipes
Use lsof -p 10970 | grep FIFO to list all pipes and find about 450 pipes
Pipes look like below
java 10970 service 1w FIFO 0,8 0t0 5890 pipe
java 10970 service 2w FIFO 0,8 0t0 5890 pipe
java 10970 service 169r FIFO 0,8 0t0 2450696 pipe
java 10970 service 201r FIFO 0,8 0t0 2450708 pipe
But I don't know how to continue. 0,8 in the output above means device numbers. How can I find devices with these numbers?
Update
The program is a TCP server and receiving socket connections from client and processing messages. I have two environments. In Production environment it works fine, but in Test environment it has this issue recently. In Production environment I don't see so many pipes. The code and infrastructure of these two environments are same, both managed by Chef.

But I don't know how to continue.
What you need to do is to identify the place or places in your Java code where you are opening these pipes ... and make sure that they are always closed when you are done with them.
The best way to ensure that the pipes are closed is to explicitly close them when you are done with them. For example (using input streams instead of sockets ...):
InputStream is = new FileInputStream("somefile.txt");
try {
// Use file
} finally {
is.close();
}
In Java 7 or later, you can write that more succinctly as ///
try (InputStream is = new FileInputStream("somefile.txt")) {
// Use file
}
In the latter, the InputStream object is automatically closed when the try completes ... in an implicit finally block.
0,8 in the output above means device numbers. How can I find devices with these numbers?
That is probably irrelevant to solving the problem. Focus on why the file descriptors are not being closed. The knowing what device numbers mean doesn't help.
In Production environment I don't see so many pipes.
That's probably a red-herring too. It could be caused by the GC running more frequently, and closing the orphaned file descriptors before the become a problem.
(But forcing the GC to run is not a solution. You should not rely on the GC to close file descriptors. It is inefficient and unreliable.)

Related

Troubleshooting a Java program that causes a server to hang with "too many files open"

I have a Java program runs (on linux) for a while and then causes the server to lock up with "Too many files open"
After restarting the machine, I run the java program again and then execute the lsof command against it's pid. A large number of lines with the following output are produced:
java 971 uknown 980u sock 0,9 0t0 20461 protocol: TCPv6
Does this means the program is opening multiple tcp connections & not closing them?
What further steps can i take to troubleshoot this?

It means your program is opening file descriptors but not closing them. It may be sockets or file handlers. So it is causing resource leak. Make sure all file handlers are closed and make it ready for garbage collected.
As dan1st pointed out in comment, one preventive measure to avoid this scenario is using try-with-resources. This will implements the AutoClosable interface and make sure automatic resource closing.

How do I receive SNMP traps on OS X?

I need to receive and parse some SNMP traps (messages) and I would appreciate any advice on getting the code I have working on my OS X machine. I have been given some Java code that runs on Windows with net-snmp. I'd like to either get the Java code running on my development machine or whip up some Python code to do the same.
I was able to get the Java code to compile on my OS X machine and it runs without any complaints, including none of the exceptions I would expect to be thrown if it was unable to bind to socket 8255. However, it never reports receiving any SNMP traps, which makes me wonder whether it's really able to read on the socket. Here's what I gather to be the code from the Java program that binds to the socket:
DatagramChannel dgChannel1=DatagramChannel.open();
Selector mux=Selector.open();
dgChannel1.socket().bind(new InetSocketAddress(8255));
dgChannel1.configureBlocking(false);
dgChannel1.register(mux,SelectionKey.OP_READ);
while(mux.select()>0) {
Iterator keyIt = mux.selectedKeys().iterator();
while (keyIt.hasNext()) {
SelectionKey key = (SelectionKey) keyIt.next();
if (key.isReadable()) {
/* processing */
}
}
}
Since I don't know Java and like to mess around with Python, I installed libsnmp via easy_install and tried to get that working. The sample programs traplistener.py and trapsender.py have no problem talking to each other but if I run traplistener.py waiting for my own SNMP signals I again fail to receive anything. I should note that I had to run the python programs via sudo in order to have permission to access the sockets. Running the java program via sudo had no effect.
All this makes me suspect that both programs are having problem with OS X and its sockets, perhaps their permissions. For instance, I had to change the permissions on the /dev/bpf devices for Wireshark to work. Another thought is that it has something to do with my machine having multiple network adapters enabled, including eth0 (ethernet, where I see the trap messages thanks to Wireshark) and eth1 (wifi). Could this be the problem?
As you can see, I know very little about sockets or SNMP, so any help is much appreciated!
Update: Using lsof (sudo lsof -i -n -P to be exact) it appears that my problem is that the java program is only listen on IPv6 when the trap sender is using IPv4. I've tried disabling IPv6 (sudo ip6 -x) and telling java to use IPv4 (java -jar bridge.jar -Djava.net.preferIPv4Stack=true) but I keep finding my program using IPv6. Any thoughts?
java 16444 peter 34u IPv6 0x12f3ad98 0t0 UDP *:8255
Update 2: Ok, I guess I had the java parameter order wrong: java -Djava.net.preferIPv4Stack=true -jar bridge.jar puts the program on IPv4. However, my program still shows no signs of receiving the packets that I know are there.

The standard port number for SNMP traps is 162.
Is there a reason you're specifying a different port number ? You can normally change the port number that traps are sent on/received on, but obviously both ends have to agree. So I'm wondering if this is your problem.

Ok, the solution to get my code working was to run the program as java -Djava.net.preferIPv4Stack=true -jar bridge.jar and to power cycle the SNMP trap sender. Thanks for your help, Brian.

How do I open 20000 clients in Java without increasing file limit?

Whenever I open a socket channel. If the client accepts then 1 file descriptor is created internally so I can create a maximum of 1024 clients in Linux.
But I want to create more clients without increasing file descriptor limit in Linux
(ulimit -n 20000)
So how can I create more sockets in Java?

If your session is limited to 1024 file descriptors you can't use more then that from a single JVM.
But since the ulimit is a per-process limitation, you could probably get around it by starting more JVMs (i.e. to get 2048 connections start two JVMs each using 1024).

If you are using UDP, can you multiplex on a single local socket youself? You'll be able to separate incoming packets by their source address and port.
If it's TCP you're out of luck, and the TIME_WAIT period after closing each socket will make things worse.

Why cant you increase the ulimit ? It seems like an artificial limitation. There is no way from java code (afaik) that allows you access to the system to reset the ulimit - it needs to be set before the process starts - in a startup script or something similar.
The JBoss startup scripts peform a 'ulimit -n $MAX_FD' before they start Jboss ...
Len

The limit RLIMIT_NOFILE is enforced by the operative system and limits the highest fd a process can create. One fd is used for every file, pipe and socket that is opened.
There are hard and soft limits. Any process (like your shell or jvm) is permitted to change the soft value but only a privileged process (like a shell run by the root user) can change the hard value .
a) If you are not permitted to change the limit on the machine, find someone that are.
b) If you for some reason can't be bothered to type ulimit, I guess you can call the underlying system call using JNA : man setrlimit(2). (.exec() won't do as it's a built in command)
See also Working With Ulimit

We recently upped our ulimit because our java process was throwing lots of "Too many files open" exceptions.
It is now 65536 and we have not had any issues.

If you really are looking at coping with a huge number of connections then the bast way to do it scalably would be to implement a lightweight dataserver process that has no responsibility other than accepting and forwarding data to a parent process.
That way as the each dataserver gets saturated you simply spawn a new instance to give yourself another 1024 connections. You could even have them exist on seperate machines if needed.

Too Many Files Open error in java NIO

Hi I have created a socket and client program using java NIO.
My server and client are on different computers ,Server have LINUX OS and CLIENT have WINDOWS OS. Whenever I have created 1024 sockets on client my client machines supports but in server I got too many files open error.
So How to open 15000 sockets without any error in server.
Or is there any other way to connect with 15000 clients at the same time?
Thanks
Bapi

Ok, questioning why he needs 15K sockets is a separate discussion.
The answer is that you are hitting the user's file descriptor limit.
Log with the user you will use in the listener and do $ulimit -n to see the current limit.
Most likely 1024.
Using root edit the file /etc/security/limits.conf
and set ->
{username} soft nofile 65536
{username} hard nofile 65536
65536 is just a suggestion, you will need to figure that out from your app.
Log off, log in again and recheck with ulimit -n, to see it worked.
Your are probably going to need more than 15 fds for all that. Monitor your app with lsof.
Like this:
$lsof -p {pid} <- lists all file descriptors
$lsof -p {pid} | wc -l <- count them
By the way, you might also hit the system wide fd limit, so you need to check it:
$cat /proc/sys/fs/file-max
To increase that one, add this line to the /etc/sysctl.conf
#Maximum number of open FDs
fs.file-max = 65535

Why do you need to have 15000 sockets on one machine? Anyway, look at ulimit -n

If you're going to have 15,000 clients talking to your server (and possibly 200,000 in the future according to your comments) then I suspect you're going to have scalability problems servicing those clients once they're connected (if they connect).
I think you may need to step back and look at how you can architect your application and/or deployment to successfully achieve these sort of numbers.

Is there a file descriptor leak when using sockets on a linux platform?

If I open and close a socket by calling for instance
Socket s = new Socket( ... );
s.setReuseAddress(true);
in = s.getInputStream();
...
in.close();
s.close();
Linux states that this socket is still open or at least the file descriptor for the connection is presen. When querying the open files for this process by lsof, there is an entry for the closed connection:
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 9268 user 5u sock 0,4 93417 can't identify protocol
This entry remains until the program is closed. Is there any other way to finally close the socket?
I'm a little worried that my java application may block to many file descriptors. Is this possible? Or does java keep these sockets to re-use them even is ReuseAdress is set?

If those sockets are all in the TIME_WAIT state, this is normal, at least for a little while. Check that with netstat; it is common for sockets to hang around for a few minutes to ensure that straggling data from the socket is successfully thrown away before reusing the port for a new socket.

You may also want to check /proc/<pid>/fd, the directory will contain all of your currently opened file descriptors. If a file disappears after you closed the socket you will not run into any problems (at least not with your file descriptors :).

I think it's not your program's problem.
In SUN_Java, when socket related native lib loaded, A MAGIC_SOCK fd will be created.
write on the MAGIC_SOCK will resulted a Connect Rest Exception, and read on the MAGIC_SOCK will resulted a EOF.
the magic_sock's peer has been fully closed, and the magic_sock itself is half_closed, and the state will remain "can't identify protocol".

Maybe it's a socket of some other protocol ("Can't identify protocol" eh?) used internally in the implementation to do something, which gets created on the first socket.
Have you tried repeatedly creating sockets and closing them, to see if these sockets really persist? It seems likely that this is a one-off.
Java probably uses sockets internally for lots of things - they might be Unix, Netlink (under Linux) or some other type of socket.

Create a small bash script to monitor opened sockets for a certain app or pid, and let it run while testing your java app.
I doubt anyway that there are any kind of leaks in this thing as sockets are very used in linux/unix world and this kind of problem would bubble up very quicky

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Linux - "Too many open files" with pipe, how to debug - java

Related

Troubleshooting a Java program that causes a server to hang with "too many files open"

How do I receive SNMP traps on OS X?

How do I open 20000 clients in Java without increasing file limit?

Too Many Files Open error in java NIO

Is there a file descriptor leak when using sockets on a linux platform?

Categories

Resources