Nutch hadoop map reduce java heap space outOfMemory - java

I am running a Nutch 1.16, Hadoop 2.83, Solr 8.5.1 crawler setup that is running fine up to a few million indexed pages. Then I am running into Java Heap Space issues during the MapReduce job and I just cannot seem to find the correct way to up that heap space. I have tried:
Passing -D mapreduce.map.memory.mb=24608 -D mapreduce.map.java.opts=-Xmx24096m when starting nutch crawl.
Editing NUTCH_HOME/bin/crawl commonOptions mapred.child.java.opts to -Xmx16000m
Setting HADOOP_HOME/etc/hadoop/mapred-site.xml mapred.child.java.opts to -Xmx160000m -XX:+UseConcMarkSweepGC
Copying said mapred-site.xml into my nutch/conf folder
None of that seems to change anything. I run into the same Heap Space error at the same point in the crawling process. I have tried reducing the fetcher threads back to 12 from 25 and switching off parsing while fetching. Nothing changed and I am out of ideas. I have 64GB RAM so thats really not an issue. Please help ;)
EDIT: fixed filename to mapred-site.xml

Passing -D ...
The heap space needs to be set also for the reduce task using "mapreduce.reduce.memory.mb" and "mapreduce.reduce.java.opts". Note that the script bin/crawl was recently improved in this regard, see NUTCH-2501 and the recent bin/crawl script.
3./4. Setting/copying hadoop-site.xml
Shouldn't this be set in "mapred-site.xml"?

Related

Set heap memory for neo4j-admin import

I'm tying to load a graph of several hundred million nodes using the neo4j-admin import tool to load the data from csv. The import will run for about two hours but then crashes with the following error:
Exception in thread "Thread-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.substring(String.java:1969)
at java.util.Formatter.parse(Formatter.java:2557)
at java.util.Formatter.format(Formatter.java:2501)
at java.util.Formatter.format(Formatter.java:2455)
at java.lang.String.format(String.java:2940)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector$RelationshipsProblemReporter.getReportMessage(BadCollector.java:209)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector$RelationshipsProblemReporter.message(BadCollector.java:195)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector.processEvent(BadCollector.java:93)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector$$Lambda$110/603650290.accept(Unknown Source)
at org.neo4j.concurrent.AsyncEvents.process(AsyncEvents.java:137)
at org.neo4j.concurrent.AsyncEvents.run(AsyncEvents.java:111)
at java.lang.Thread.run(Thread.java:748)
I've been trying to adjust my max and initial heap size settings in a few different ways. First I tried simply creating a HEAP_SIZE= variable before running the command to load the data as described here and I tried setting the heap size on the JVM like this:
export JAVA_OPTS=%JAVA_OPTS% -Xms100g -Xmx100g
but whatever I setting I use when the import starts I get the same report:
Available resources:
Total machine memory: 1.48 TB
Free machine memory: 95.00 GB
Max heap memory : 26.67 GB
Processors: 48
Configured max memory: 1.30 TB
High-IO: true
As you can see, I'm building this on a large server that should have plenty of resources available. I'm assuming I'm not setting the JVM parameters correctly for Neo4j but I can't find anything online showing me the correct way to do this.
What might be causing my GC memory error and how can I resolve it? Is this something I can resolve by throwing more resources at the JVM and if so, how do I do that so the neo4j-admin import tool can use it?
RHEL 7 Neo4j CE 3.4.11 Java 1.8.0_131
The issue was resolved by increasing the maximum heap memory. The problem was I wasn't setting the heap memory allocation correctly.
It turns out there was a simple solution; it was just a matter of when I tried to set the heap memory. Initially, I had tried the command export JAVA_OPTS='-server -Xms300g -Xmx300g' at the command line then run my bash script to call neo4j-admin import. This was not working, neo4j-admin import continued to use the same heap space configuration regardless.
The solution was to simple include the command to set the heap memory in the shell script that called neo4j-admin import. My shell script ended up looking like this:
#!/bin/bash
export JAVA_OPTS='-server -Xms300g -Xmx300g'
/usr/local/neo4j-community-3.4.11/bin/neo4j-admin import \
--ignore-missing-nodes=true \
--database=mag_cs2.graphdb \
--multiline-fields=true \
--high-io=true \
This seems super obvious but it took me almost a week to realize what I needed to change. Hopefully, this saves someone else the same headache.

Reindexing Solr: java.lang.OutOfMemoryError: Java heap space solr

I indexed a directory containing 16k files of pdfs/docs..etc and everything worked great. However, I tried to reindex my collection, and I got the "java.lang.OutOfMemoryError: Java heap space solr" error for every line that Solr tried to index. I looked into the issue online already, and I tried to change my indexing command from java -Dc=collection -Drecursive -Dauto -jar example/exampledocs/post.jar c:/folder to java -Dc=collection -Xms1024m -Xmx1024m -Drecursive -Dauto -jar example/exampledocs/post.jar c:/folder but I got the same errors (I don't know if it was right of me to add those commands though). I've attached an image of my collection storage information. How can I fix this error?
According to this:
https://sitecore.stackexchange.com/questions/8849/java-lang-outofmemoryerror-during-solr-index-rebuilding
Start solr with the following arguments to incease its memory:
solr start -m 4096m

Increase memory for jMeter on command line

I am running jMeter from the command line on a Mac. Today it threw an Out of memory, heap space error....
newbie$ sh jmeter.sh
Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:41)
at java.awt.image.Raster.createPackedRaster(Raster.java:455)
I know I need to increase the memory allocated to it, but not sure how. I looked at this post Unable to increase heap size for JMeter on Mac OSX and found that jMeter script file in the bin folder it mentions and made the below updates..
HEAP="-Xms1024m -Xmx2048m"
NEW="-XX:NewSize=512m -XX:MaxNewSize=1024m"
But I am still getting the out of memory error. Do I just need to give it more or am I changing in the wrong place? Could it be I need to restart my whole machine?
As far as I understand:
You made changes to jmeter script
You're launching jmeter.sh script
You want to know why the changes are not applied?
If you changed jmeter script why don't you just launch it as ./jmeter ?
If you need to start JMeter via jmeter.sh for any reason, run it as follows:
JVM_ARGS="-Xms1024m -Xmx2048m -XX:NewSize=512m -XX:MaxNewSize=1024m" && export JVM_ARGS && ./jmeter.sh
See Running JMeter User Manual chapter in particular and The Ultimate JMeter Resource List in general for the relevant documentation.
If you have trouble finding it in the logs, then
You can use
ps -ef | grep jmeter
this may give you the details( Not a mac user, but I thing ps -ef would work)
The other option is to use jvisualvm, it ships already with jdk, so no extra tool is required.Run the visualvm and the jmeter, you can guess the name of the application ( entry of jemter ) on the left pane of visualvm , click on it, and all the jvm details will be available.
After this you can confirm whether jmeter is availed with 2GB max Ram. And increase if needed.
There could be different possible reasons for OutOfMemory Error. If you have successfully changed the allocated memory/heap size and still getting the issue then you can look into following factors:
Listeners: Do not use 'TreeView' and 'TableView' Listeners in actual load test as they consume lot of memory. Best practice is to save results in .JTL file, which can be later used for getting different reports.
Non-GUI Mode: Do not use GUI mode while performing actual load test. Run test from command line.
For more, visit the following blog as it has some really nice tips to solve OutOfMemory issues in JMeter.
http://www.testingdiaries.com/jmeter-out-of-memory-error/

Increase Heap Space Available for JVM: OutOfMemoryError: Requested array size exceed VM limit Ubuntu 64Bit Neo4j 2.0

My specs:
-Ubuntu 64bit
-Neo4j 2.0
-32 GB of Ram
-AMD FX-8350 Eight COre Processor
The problem:
I'm making a request to my Neo4j server with the following query:
MATCH (being:my_label_2) RETURN being
And gives me this error:
OutOfMemoryError
Requested array size exceeds VM limit
StackTrace:
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
java.lang.StringCoding.encode(StringCoding.java:344)
java.lang.String.getBytes(String.java:916)
org.neo4j.server.rest.repr.OutputFormat.toBytes(OutputFormat.java:194)
org.neo4j.server.rest.repr.OutputFormat.formatRepresentation(OutputFormat.java:147)
org.neo4j.server.rest.repr.OutputFormat.response(OutputFormat.java:130)
org.neo4j.server.rest.repr.OutputFormat.ok(OutputFormat.java:67)
org.neo4j.server.rest.web.CypherService.cypher(CypherService.java:101)
java.lang.reflect.Method.invoke(Method.java:606)
org.neo4j.server.rest.transactional.TransactionalRequestDispatcher.dispatch(TransactionalRequestDispatcher.java:139)
org.neo4j.server.rest.security.SecurityFilter.doFilter(SecurityFilter.java:112)
This works fine with "my_label_1" which returns around 30k results
What I believe is the problem:
I don't have enough memory allocated to my JVM
Attempts made to fix/things I've found online:
I read what the manual says to do
And what the Ubuntu Forums say to do
So I've tried going to my neo4 folder (with cd as usual) and running it with the arguments this way:
sudo bin/neo4j start -Xmx4096M
However that didn't work. When Neo4j starts it does warn me that I might not have enough space with:
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Using additional JVM arguments: -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
Question
I know I'm definitely using the arguments wrong, I honestly don't have much experience with JVM configurations. How should I approach this, am I missing something?
You should put JVM setting into the conf/neo4j-wrapper.conf file. It should look like this:
user#pc:> head -n 7 neo4j-enterprise-2.0.0/conf/neo4j-wrapper.conf
wrapper.java.additional=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional=-Dlog4j.configuration=file:conf/log4j.properties
# Java Additional Parameters
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
Note that you can configure different aspects of neo4j via different files, so it's better to read description to every file in that conf/ directory in order to get familiar with what can be done and how exactly.

Where to find the heap dump after an "Out Of Memory"

I'm using ASANT to run a xml file which points to a NARS.jar file.
I'm getting "java.lang.OutOfMemoryError: Java heap space" and i'm researching around this.
So i have found that i need to set "-XX:+HeapDumpOnOutOfMemoryError", to create a dump file to analyze.
I edited ASANT.bat and added the "-XX:+HeapDumpOnOutOfMemoryError" to ANT_OPTS:
set ANT_OPTS= "-XX:+HeapDumpOnOutOfMemoryError" "-Dos.name=Windows_NT" "-Djava.library.path=%AS_INSTALL%\lib;%AS_ICU_LIB%;%AS_NSS%" "-Dcom.sun.aas.installRoot=%AS_INSTALL%" "-Dcom.sun.aas.instanceRoot=%AS_INSTALL%" "-Dcom.sun.aas.instanceName=server" "-Dcom.sun.aas.configRoot=%AS_CONFIG%" "-Dcom.sun.aas.processLauncher=SE" "-Dderby.root=%AS_DERBY_INSTALL%"
But i can't seem to find any dump file.
I will use the Eclipse Memory Analyzer to analyze when i find the dump.
I also tried to set the option "-XX:HeapDumpPath=c:\memdump\bds.hprof", but no dump was created there.
Anyone got an idea of what i'm doing wrong?
Thanks in advance
It looks like your application is running on Windows. A Windows file path needs to be escaped with \. As per your example, -XX:HeapDumpPath should look like:
-XX:HeapDumpPath=c:\\memdump\\bds.hprof
Besides ‘-XX:+HeapDumpOnOutOfMemoryError’ there are several other options to capture heap dumps as well.
I found that i could use VisualVM from SUN to get a heapdump, and see it live.
Easy solution
It's in the working directory of the application (i.e. where you've started it). I'm not sure what happens if the process does not have the necessary privileges to do so. Probably, writing the dump would fail silently.
are you sure that ANT is the process with the OOME ? It may be a process started by ANT.
Add "-debug" to the ANT_OPTS for debugging information.
Are you seeing the targets being printed out during the execution?
You can also fork the various processes started by ant ( will slow things down but may help isolate the culprit )
Lastly, maybe you just need more memory than the default. Add:
-Xms256m -Xmx512m -XX:PermSize=64m -XX:MaxPermSize=256m
to the ANT_OPTS
Umm... how about wherever java.io.tmpdir is pointing?

Categories