Open Search Server is crashing while crawling files. OSS is running as a daemon on an Ubuntu box. This is a production server with 64gb ram and 12 cores, crawling files on an extremely fast nas that it mounts, about 20 gb of files. 2gb memory allotted for OSS. The largest file that should get crawled is about 1.3gb. There are 5 mp4 files that are all over 1gb.
Usually at some point during the crawl process, OSS will become completely unresponsive. Restarting OSS fixes the problem. Today I monitored a crawl, which usually uses one or two cores at a time. When it crashed it was maxing out all 12 cores. Total memory usage on the server was fine, but I'm not sure how much OSS was using.
We've looked at the oss log files and there's not a single error that happens before each crash, but there are two errors that are pretty common in the logs:
WARN: org.apache.cxf.jaxrs.utils.JAXRSUtils - Both com.jaeksoft.searchlib.webservice.crawler.database.DatabaseImpl#run and com.jaeksoft.searchlib.webservice.crawler.database.DatabaseImpl#run are equal candidates for handling the current request which can lead to unpredictable results
WARN: root - Low memory free conditions: flushing crawl buffer
We have one index that handles all files. It is based on the file crawler template—the only changes are:
An extra analyzer that uses 4 regex replaces.
An extra field that copies the url field and uses the analyzer from
We added one disk location, which has all the files.
We join another index in our query.
When we are able to crawl, querying the index works fine afterwards. I think maybe the crashes only happen if there's a search query on the index during the crawl, but haven't been able to confirm that yet.
Related
We've been using Hazelcast for a number of years but I'm new to the group.
We have a cluster formed by a dedicated Java application (it's sole purpose is to provide the cluster). It's using the 3.8.2 jars and running JDK 1.8.0_192 on Linux (Centos 7).
The cluster manages relatively static data (ie. a few updates a day/week). Although an update may involve changing a 2MB chunk of data. We're using the default sharding config with 271 shards across 6 cluster members. There are between 40 and 80 clients. Each client connection should be long-lived and stable.
"Occasionally" we get into a situation where the Java app that's providing the cluster repeatedly restarts and any client that attempts to write to the cluster is unable to do so. We've had issues in the past where the cluster app runs out of memory due to limits on the JVM command line. We've previously increased these and (to the best of my knowledge) the process restarts are no longer caused by OutOfMemory exceptions.
I'm aware we're running a very old version and many people will suggest simply updating. This is work we will carry out but we're attempting to diagnose the existing issue with the system we have in front of us.
What I'm looking for here is any suggestions regarding types of investigation to carry out, queries to run (either periodically when the system is healthy or during the time when it is in this failed state).
We use tools such as: netstat, tcpdump, wireshark and top regularly (I'm sure there are more) when diagnosing issues such as this but have been unable to establish a convincing root cause of this issue.
Any help greatly appreciated.
Thanks,
Dave
As per the problem description.
Our only way to resolve the issue is to bounce the cluster completely - ie. stop all the members and then restart the cluster.
Ideally we'd have a system to remained stable and could recover from whatever "event" causes the issue we're seeing.
This may involve config or code changes.
Updating entries the size of 2MBs has many consequences - large serialization/deserialization costs, fat packets in the network, cost of accommodating those chunks in JVM heap etc. An ideal entry size is < 30-40KB.
To your immediate problem, start with GC diagnosis. You can use jstat to investigate memory usage patterns. If you are running into lot of full GCs and/or back-to-back full GCs then you will need to adjust heap settings. Also check the network bandwidth, which is usually the prime suspect in the cases of fat packets traveling through the network.
All of the above are just band-aid solutions, you should really look to break your entries down to smaller entries.
All different files are processing fine, but this file seems to be special.
The solution is to restart both the Cassandra Database, Java application and re-upload the file into the S3 bucket for processing. Then the same file is processed correctly.
Right now, we're restarting the Java application and Cassandra database every Friday morning. We're suspecting accumulation of something to be a possible root cause of the problem, as the file is processed perfectly fine after a complete restart.
This is a screenshot of the error in Cassandra:
We're using Cassandra as a backend for Akka Persistence.
So a failure to ingest the file only happens when the cluster has been up for some time. I don't have a failure to ingest, if it's done soon after cluster start.
First, that's not an ERROR, it's an INFO. Secondly, it's telling you that you're writing into cache faster than the cache can be recycled. If you're not seeing any negative effects (data loss, stale replicas, etc), I wouldn't sweat this. Hence, the INFO and not ERROR.
If you are and you have some spare non-heap RAM on the nodes, you could try increasing file_cache_size_in_mb. It defaults to 512MB, so you could try doubling that and see if it helps.
we're restarting the Java application and Cassandra database every Friday morning
Also, there's nothing to really gain by restarting Cassandra on a regular basis. Unless you're running it on a Windows machine (seriously hope you are not), you're really not helping anything by doing this. My team supports high write throughput nodes that run for months, and are only restarted for security patching.
Recently came across and interesting scenario with Cloudera Hadoop and HDFS where we were unable to start our NameNode Service.
When attempting a restart of HDFS Services we were unable successfully restart NameNode Service in our cluster. Upon review of the logs, we did not observe any ERRORs but did see a few entries related to JvmPauseMonitor...
org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5015ms
We were observing these entries in the /var/log/hadoop-hdfs/NAMENODE.log.out and were not seeing any other errors including /var/log/messages.
CHECK YOUR JAVA HEAP SIZES
Ultimately, we were able to determine that we were running into a Java OOM Exception that wasn't being logged.
From a performance perspective as a general rule for every 1 Million Blocks in HDFS you should have configured at least 1GB of Java Heap Size.
In our case, the resolution was as simple as increasing the Java Heap Size for the NameNode and Secondary NameNode Services and Restarting... as we had grown to 1.5 Million Blocks but were only using the default 1GB setting for the java heap size.
After increasing the Java Heap Size to at least 2GB and restarting the HDFS Services we were green across the board.
Cheers!
Too much logging activity over the weekend led to the following error being thrown by ColdFusion:
Message: No space left on device
StackTrace: java.io.IOException: No space left on device at
java.io.FileOutputStream.writeBytes(Native Method) at
java.io.FileOutputStream.write(FileOutputStream.java:269) at
coldfusion.compiler.NeoTranslator. ......
By this morning pages on the ColdFusion web site were not loading at all. The disc (12Gig) was over 99% used up. We moved several files to the second disc and now it's at about 80%, much lower than where it's always been. We're going to direct logging activity to the second disc (100Gig) to prevent a repeat. Having created space on the disc we restarted apache and coldfusion but still pages are not loading.
When we run top -H it appears that java is running at close to 100% CPU. Does anyone have a clue what's going on or what info I need to provide so someone can figure it out?
The set up is AWS, ubuntu 13.04, coldfusion 10, mysql (rds).
UPDATE
I have made some really strange but hopefully helpful observations. I am still trying to locate a tool to help with getting thread dumps. Whenever I restart ColdFusion, most pages load fine and CPU usage seems normal, mostly 0.7 - 1.5% but once in a while spikes of up to 10% can be seen. But there's one particular page, which when I try to load, causes CPU uses to rise to 97% always. The pages that load fine have a simple query reading data from one table. This problematic page has an inner join, and reads data from two tables. I don't know how helpful this is but I think it is too consistent to be insignificant.
UPDATE 2
And since this issue started the following errors appeared for the first time and there're hundreds of lines of it:
[2243:140630871263104] [error] ajp_send_request::jk_ajp_common.c (1649):
(cfusion) connecting to backend failed. Tomcat is probably not started
or is listening on the wrong port (errno=111)
// and
[2234:140630871263104] [info] ajp_connect_to_endpoint::jk_ajp_common.c (1027):
Failed opening socket to (127.0.0.1:8012) (errno=111)
// and
[2756:139822377088896] [info] jk_handler::mod_jk.c (2702): No body with status=500
for worker=cfusion
UPDATE 3 - RESOLVED
After stripping the "offending" page of all the code in it and simply replacing it with some plain text and tried to load the page several times, we realized that ColdFusion was not loading the live pages. It was loading cached compiled versions of the pages, normally found in a subfolder named <cf-root>/cfusion/wwwroot/WEB-INF/cfclasses. Removing (or renaming) the subfolder resolved the issue.
You would probably want to provide thread dumps, a few of them taken at a few seconds interval, and gc logs.
Thread dumps can be produced with jstack (a tool provided in your jdk's bin directory), and garbage collector logs must be activated beforehand.
We are currently using a modified version of Atlassian Confluence v.3.5
and have created a space containing large number of pages (about 5000) and large number of attachments (about 10000).
When navigating to the home page of this big space it takes about 3 minutes to load completely (the safari web browser shows a spinning wheel indicating page resources are still being loaded).
In these 3 minutes, we are unable to determine where the processing time is being spent.
We turned on confluence's profiling feature but it did not help because there was not much output in the log file.
The confluence process (which is a java process) is using about 8.2% CPU during the 3 minutes. How can I figure out what the process is doing?
You have all these options:
HeapDumpOnCtrlBreak
HeapDumpOnOutOfMemoryError
Jmap
HotSpotDiagnosticMXBean
A Thread Dump may also be useful. You can use it to figure out what the threads are waiting on.
You can also use a profiler. The best one I've used is JProfiler. But there's other ones available that are free and open source. I think netbeans comes with one. And sun makes one called VisualVM.