Why does my Apache Nutch warc and commoncrawldump fail after crawl?

Why does my Apache Nutch warc and commoncrawldump fail after crawl? - java

I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement .... works successfully on the same segment folder.
I am using nutch v-1.17 and running:
bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments
The error from hadoop.log is ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/
despite having just ran a crawl there.

Inside the segments folder were segments from a previous crawl that were throwing up the error. They did not contain all the segment data as I believe the crawl was cancelled/finished early. This caused the entire process to fail. Deleting all those files and starting anew fixed the issue.

Related

Error while trying to write on parquet file in datastage 11.7 (File_Connector_20,0: java.lang.NoClassDefFoundError: org.apache.hadoop.fs.FileSystem)

we have recently upgraded the DataStage from 9.1 to 11.7 on Server AIX 7.1 .
and i'm trying to use the new connector "File Connector" to write on parquet file. i created simple job takes from teradata as a source and write on the parquet file as a target.
Image of the job
but facing below error :
> File_Connector_20,0: java.lang.NoClassDefFoundError: org.apache.hadoop.fs.FileSystem
at java.lang.J9VMInternals.prepareClassImpl (J9VMInternals.java)
at java.lang.J9VMInternals.prepare (J9VMInternals.java: 304)
at java.lang.Class.getConstructor (Class.java: 594)
at com.ibm.iis.jis.utilities.dochandler.impl.OutputBuilder.<init> (OutputBuilder.java: 80)
at com.ibm.iis.jis.utilities.dochandler.impl.Registrar.getBuilder (Registrar.java: 340)
at com.ibm.iis.jis.utilities.dochandler.impl.Registrar.getBuilder (Registrar.java: 302)
at com.ibm.iis.cc.filesystem.FileSystem.getBuilder (FileSystem.java: 2586)
at com.ibm.iis.cc.filesystem.FileSystem.writeFile (FileSystem.java: 1063)
at com.ibm.iis.cc.filesystem.FileSystem.process (FileSystem.java: 935)
at com.ibm.is.cc.javastage.connector.CC_JavaAdapter.run (CC_JavaAdapter.java: 444)
i followed the steps in below link :
https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.7.0/com.ibm.swg.im.iis.conn.s3.usage.doc/topics/amaze_file_formats.html
1- i uploaded the jar files into "/ds9/IBM/InformationServer/Server/DSComponents/jars"
2- added them to CLASSPATH in agent.sh then restarted the datastage.
3- i have set The environment variable CC_USE_LATEST_FILECC_JARS to the value parquet-1.9.0.jar:orc-2.1.jar.
i tried also to add the CLASSPATH as an environment variable in the job but not worked.
noting that i'm using Local in File System.
so any hint is appreciated as i'm searching a lot time ago.
Thanks in advance,

Which File System mode you are using ? If you are using Native HDFS as File System mode, then you would need to configure CLASSPATH to include some third party jars.
Perhaps these links should provide you with some guidance.
https://www.ibm.com/support/pages/node/301847
https://www.ibm.com/support/pages/steps-required-configure-file-connector-use-parquet-or-orc-file-format
Note : Based on the hadoop distribution and version you are using, the version of the jars could be different.
If the above information does not help in resolving the issue, then you may have to reach out to IBM Support to get this addressed.

TO use File Connector, there is no need to add CLASSPATH in agent.sh unless you want to import HDFS files from IMAM.
If your requirement is reading Parquet files, then set
$CC_USE_LATEST_FILECC_JARS=parquet-1.9.0.jar
$FILECC_PARQUET_AVRO_COMPAT_MODE=TRUE
If you are still seeing issue, then run job with $CC_MSG_LEVEL=2 and open IBM support case along with job design, FULL job log and Version.xml file from Engine tier.

HIPI on hadoop1.2.1 exceptions while running hibInfo.sh

Friends,
I started using hipi on my hadoop1.2.1 multinode cluster(1master, 2 slaves). I have installed gradle, build everything as mentioned in the instructions of virginia.edu getting started.
When I run hibImport.sh with set of images from local disk, they are exported successfully to Hadoop as .hib and .hib.dat files. I can browse them in HDFS
But when I run hibInfo.sh , It is throwing exceptions as below. I tried everything. But nothing worked. what can be the possible problem? Can someone please help me.
Input HIB: /inputimages.hib
Display meta data: true
Display EXIF data: false
'Exception in thread "main" java.lang.NoSuchMethodError:
com.drew.metadata.Metadata.getDirectories()Ljava/lang/Iterable;' 'at
org.hipi.image.io.ExifDataReader.extractAndFlatten(ExifDataReader.java:48)'
'at org.hipi.image.io.ImageCodec.decodeImage(ImageCodec.java:118)'
Thanks

Apache storm can not find main class of storm starter

I am setting up an Apache Storm system but am having problems getting the program to run consistently. I have set up storm on three servers but it only works consistently on one. I think the issue lies somewhere in the path of the command.
I have been using storm-starter to set up the program and have tested it locally with RollingTopWords. When I run the following command $ storm jar storm-starter-*.jar storm.starter.RollingTopWords the computer stalls a second then i get the following error:
Could not find or load main class storm.starter.RollingTopWords
The jar is stored in the directory /apache/storm/examples/storm-starter/target . Let me know if there is any other information I can provide that would be of help because I'm feeling a little desperate at this point.
The following is the entire output for the program that doesn't work.
Running: /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -client -Dstorm.options= -Dstorm.home=/home/scix3/apache/storm -Dstorm.log.dir=/home/scix3/apache/storm/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /home/scix3/apache/storm/lib/kryo-2.21.jar:/home/scix3/apache/storm/lib/core.incubator-0.1.0.jar:/home/scix3/apache/storm/lib/commons-fileupload-1.2.1.jar:/home/scix3/apache/storm/lib/ring-servlet-0.3.11.jar:/home/scix3/apache/storm/lib/clj-stacktrace-0.2.2.jar:/home/scix3/apache/storm/lib/jline-2.11.jar:/home/scix3/apache/storm/lib/servlet-api-2.5.jar:/home/scix3/apache/storm/lib/disruptor-2.10.1.jar:/home/scix3/apache/storm/lib/log4j-over-slf4j-1.6.6.jar:/home/scix3/apache/storm/lib/clojure-1.5.1.jar:/home/scix3/apache/storm/lib/commons-exec-1.1.jar:/home/scix3/apache/storm/lib/logback-core-1.0.13.jar:/home/scix3/apache/storm/lib/jetty-util-6.1.26.jar:/home/scix3/apache/storm/lib/slf4j-api-1.7.5.jar:/home/scix3/apache/storm/lib/carbonite-1.4.0.jar:/home/scix3/apache/storm/lib/compojure-1.1.3.jar:/home/scix3/apache/storm/lib/minlog-1.2.jar:/home/scix3/apache/storm/lib/commons-lang-2.5.jar:/home/scix3/apache/storm/lib/tools.macro-0.1.0.jar:/home/scix3/apache/storm/lib/reflectasm-1.07-shaded.jar:/home/scix3/apache/storm/lib/tools.cli-0.2.4.jar:/home/scix3/apache/storm/lib/math.numeric-tower-0.0.1.jar:/home/scix3/apache/storm/lib/logback-classic-1.0.13.jar:/home/scix3/apache/storm/lib/tools.logging-0.2.3.jar:/home/scix3/apache/storm/lib/asm-4.0.jar:/home/scix3/apache/storm/lib/jetty-6.1.26.jar:/home/scix3/apache/storm/lib/snakeyaml-1.11.jar:/home/scix3/apache/storm/lib/hiccup-0.3.6.jar:/home/scix3/apache/storm/lib/clj-time-0.4.1.jar:/home/scix3/apache/storm/lib/jgrapht-core-0.9.0.jar:/home/scix3/apache/storm/lib/clout-1.0.1.jar:/home/scix3/apache/storm/lib/chill-java-0.3.5.jar:/home/scix3/apache/storm/lib/commons-io-2.4.jar:/home/scix3/apache/storm/lib/joda-time-2.0.jar:/home/scix3/apache/storm/lib/storm-core-0.9.4.jar:/home/scix3/apache/storm/lib/objenesis-1.2.jar:/home/scix3/apache/storm/lib/commons-logging-1.1.3.jar:/home/scix3/apache/storm/lib/ring-core-1.1.5.jar:/home/scix3/apache/storm/lib/ring-jetty-adapter-0.3.11.jar:/home/scix3/apache/storm/lib/commons-codec-1.6.jar:/home/scix3/apache/storm/lib/json-simple-1.1.jar:/home/scix3/apache/storm/lib/ring-devel-0.3.11.jar:storm-starter-.jar:/home/scix3/apache/storm/conf:/home/scix3/apache/storm/bin -Dstorm.jar=storm-starter-.jar storm.starter.RollingTopWords
Error: Could not find or load main class storm.starter.RollingTopWords

The main issue for the error
Could not find or load main class storm.starter.RollingTopWords cloud be.
Check the launch configuration while building the jar.
you must be very careful while building the jar ,it asks you to choose destination folder and launch configuration(launch configuration should be of same project)
You might have missed the main class in your project.
Before using Stormsubmitter in Remote cluster, check once weather it works properly localcluster

To check if the problem is with storm unable to find the jar, you can try issuing
storm jar /fullpath/my-storm-jar.jar Classname
Few other things you can make sure
The jar is compiled properly/jar contains the RollingTopWords class
storm.yaml points to the correct nimubs (This seems less probable, as the the connection is being made and there is an attempt to load the topology)

javahelp indexer for full text search

I have a problem with the full text search database generation in javaHelp. In order to generate the db I have to execute a command from a batch file:
java -cp jhall.jar com.sun.java.help.search.Indexer -db .\JavaHelpSearch .\html
This only works if the batch file is in the same directory of the html directory containing the files to index. I have tried to execute the batch file from an external location using absolute paths:
java -cp jhall.jar com.sun.java.help.search.Indexer -db C:\help\JavaHelpSearch C:\help\html
The db generation give no erros, but if I search a word in the javaHelp window I get the following error:
Failed to create URL from file:/C:/help/myHelp.hs|C:/help/html/myhtml.html
I have spent a lot of time facing this problem, without success. What I really need is to call the Indexer class directly from my Java application, but the same problem happens there.

InvalidInputException When loading file into Hbase MapReduce

I am very new for Hadoop and Map Reduce. For starting bases i executed Word Count Program. It executed well but when i try running csv file into Htable which i followed [Csv File][1]
It throwing me in to following error which i am not aware of it, please can any one help me in knowing the above error
12/09/07 05:47:31 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path [1]: http://salsahpc.indiana.edu/ScienceCloud/hbase_hands_on_1.htm#shell_exercises
This error is really kiiling my time, please can any one help me with this exception

The problem why you are directing to path hdfs://HadoopMaster:54310/user/hduser/csvtable instead of csvtable is.
1) Add your Hbase jars into Hadoop class path because your Map reduce doesn't by default configure to hbase jars.
2) GO to hadoop-ev.sh and edit Hadoop_classpath and add all your hbase jars in it. hope it might work now

your job is attempting to read an input file from:
hdfs://HadoopMaster:54310/user/hduser/csvtable
you should verify that this file exists on HDFS using the hadoop shell tools:
hadoop fs -ls /user/hduser/csvtable
my guess is that your file hasn't been loaded onto HDFS.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why does my Apache Nutch warc and commoncrawldump fail after crawl? - java

Inside the segments folder were segments from a previous crawl that were throwing up the error. They did not contain all the segment data as I believe the crawl was cancelled/finished early. This caused the entire process to fail. Deleting all those files and starting anew fixed the issue.

Related

Error while trying to write on parquet file in datastage 11.7 (File_Connector_20,0: java.lang.NoClassDefFoundError: org.apache.hadoop.fs.FileSystem)

HIPI on hadoop1.2.1 exceptions while running hibInfo.sh

Apache storm can not find main class of storm starter

javahelp indexer for full text search

InvalidInputException When loading file into Hbase MapReduce

Categories

Resources