Run co-occurrence algorithm in hadoop

Run co-occurrence algorithm in hadoop - java

I found the following project on github https://github.com/fbukevin/hadoop-cooccurrence which uses a co-occurrence algorithm in hadoop.
I’m using a virtualized Ubuntu 14.04 and managed to install hadoop as a single node cluster with this instruction http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php. I'm new to hadoop and this are my first attempts to run a program with yarn.
I can execute the command yarn in command line, but I don’t know how to run the co-occurrence algorithm in yarn. In the description it says that the program can be used with the following command
$ yarn jar <hadoop>.jar [pairs | stripes] <input_file>
So I tried this:
$ yarn jar /home/vmiller/Downloads/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar pairs pg100.txt
Exception in thread "main" java.lang.ClassNotFoundException: pairs
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at org.apache.hadoop.util.RunJar.run(RunJar.java:214)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
This is definitely not correct but I don't know how to run the command correctly. Somehow I have to tell yarn to use the Cooccurrence.java located in hadoop-cooccurrence/src/main/java/cooc/Cooccurrence.java because this file seems to be the one that executes the co-occurrence algorithm. But how do I tell yarn to use this file with the pairs and stripesarguments on the input file?

You should give jar the path to the jar including the Cooccurrence class.
Jar is in target folder (cooc-1.0-SNAPSHOT.jar).
You don't need to indicate the class name as it is set up in the Manifest file

I actually managed to run the programm. My approach wasn't that wrong, as tokiloutok mentioned I had to include the right jar file.
Before I could execute the command I had to import the pg100.txt into HDFS.
So I had to deactivate the safe mode of the name node with
hdfs dfsadmin -safemode leave
and import the file with
hdfs dfs -put /home/vmiller/workspace/hadoop-cooccurrence/pg100.txt /user/hadoop/
so that I could finally run
yarn jar target/cooc-1.0-SNAPSHOT.jar pairs pg100.txt
without getting any errors.

Related

Error Running Hbase Java API

Successfully compiled my Hbase class using
javac -cp "/hbase/lib/*" CreateTable.java
But during running it is throwing error
java CreateTable
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
at CreateTable.main(CreateTable.java:16)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 1 more

javac -cp `hbase classpath` CreateTable.java
java -cp `hbase classpath` CreateTable
where hbase classpath is cluster classpath where cluster has installed hbase jar files
If you want to see folder location of your hbase/lib,
you can go to hbase shell and try your hbase lib jars will be displayed there.
Note : if you are using maven for your build then you have to set 'provided' as scope where you are mentioning groupid, artifactid etc...

In addition to specifying the classpath of libraries you're depending on in order to compile your program, you need to specify them when executing your program. The dependencies aren't "compiled-in", they're just referenced during compilation to ensure that they're linked correctly, but they need to be there at runtime as well.
So, you probably want to run something like java -cp ".;/hbase/lib/*" CreateTable in order to have the same libraries at runtime as you used at compile time, as well as the current directory where your complied .class file is.
In enterprise programs, usually a dependency management system like Maven, or at least what's built into most IDEs, is used to help keep track of the dependencies and call Java and related tools with the right paths.

ClassNotFoundException when running hadoop jar

I'm attempting to run a MapReduce job from a jar file and keep getting a ClassNotFoundException error. I'm running Hadoop 1.2.1 on a Centos 6 virtual machine.
First I compiled the file exercise.java (and class) into a jar file exercise.jar using the following shell script compile.sh :
#!/bin/bash
javac -classpath /pathto/hadoop-common-1.2.1.jar:\
/pathto/hadoop-core-1.2.1.jar /pathto/exercise.java
jar cvf exercise.jar /pathto/*.class
This runs fine and the jar completes successfully. I then attempt to run the actual MapReduce job using shell script exec.sh:
#!/bin/bash
export CLASSPATH=$CLASSPATH:/pathto/hadoop-common-1.2.1.jar:\
/pathto/hadoop-core-1.2.1.jar:/pathto/exercise.class
hadoop jar exercise.jar exercise /data/input/inputfile.txt /data/output
This trows the ClassNotFoundException error :
Exception in thread "main" java.lang.ClassNotFoundException: exercise
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
I realize the explicit path names might not be necessary but I've been a little desperate to double check everything. I've confirmed that in my exercise.java file exercise.class is in the job configuration via job.setJarByClass(exercise.class); and confirmed exercise.class is contained in exercise.jar. Can't seem to figure it out.
UPDATE
The exec.sh script with the full path of exercise.class. It's stored in my Eclipse project directory:
#!/bin/bash
export CLASSPATH=$CLASSPATH:/pathto/hadoop-common-1.2.1.jar:\
/pathto/hadoop-core-1.2.1.jar:/home/username/workspace/MVN_Hadoop/src/main/java.com.amend.hadoop.MapReduce/*
hadoop jar \
exercise.jar \
/home/username/workspace/MVN_Hadoop/src/main/java.com.amend.hadoop.MapReduce/exercise \
/data/input/inputfile.txt \
/data/output
When I actually try and run the exec.sh script using the explicitly written out path names, I also get a completely different set of errors:
Exception in thread "main" java.lang.ClassNotFoundException: /home/hdadmin/workspace/MVN_Hadoop/src/main/java/come/amend/hadoop/MapReduce/exercise
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)

I could see this possible errors.
From the Hadoop jar exercise.jar exercise /data/input/inputfile.txt /data/output please specify the full path of the exercise class. I.e org.name.package.exercise if exists. To cross check open the jar file and check the location of exercise.class location.
To continue, Hadoop doesn't expect jars to be included within the jars, since the path of Hadoop is set globally.
NEW:
See, the following path is some thing weird. "/home/hdadmin/workspace/MVN_Hadoop/src/main/java/come/amend/hadoop/MapReduce/exercise"
If you are running using your jar, how could a class path be so specific, instead of jar path. It could only be "come/amend/hadoop/MapReduce/exercise" this.

EMR-4.2.0 Error during running of custom jar (command-runner)

I am running sqoop installation script in AWS - EMR-4.2.0 Version, followed this documentation.
After created cluster (at Steps), I have submitted my sqoop script as an arguments and s3://elasticmapreduce/libs/script-runner/script-runner.jar/ command-runner.jar as a jar file, but getting error like this. Can you help me pls what is the cause and problem?
Error:
Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Cannot run program "s3://bmsgcm/spark/install-sqoop.sh" (in directory "."): error=2, No such file or directory
at com.amazonaws.emr.command.runner.ProcessRunner.exec(ProcessRunner.java:139)
at com.amazonaws.emr.command.runner.CommandRunner.main(CommandRunner.java:13)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Cannot run program "s3://bmsgcm/spark/install-sqoop.sh" (in directory "."): error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at com.amazonaws.emr.command.runner.ProcessRunner.exec(ProcessRunner.java:92)
... 7 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 8 more

command-runner.jar can only read local files. You can add a bootstrap script to copy files from S3 to local file system.

Piggybox is correct. Unlike script-runner.jar that was used on the 2.x and 3.x EMR AMIs, command-runner.jar can only run local commands. A bootstrap script is the best way to do this.
For instance, if you have a few spark drivers on S3, and you have a shell script (also on S3) to copy them to the master node for later use in a job flow step with spark-submit, then you might have had a step like this:
Steps=[
{
'Name': 'Install My Spark Drivers',
'ActionOnFailure':'TERMINATE_JOB_FLOW',
'HadoopJarStep':
'Jar': 's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': [
's3://my-bucket/spark-driver-install.sh',
]
}
},
...other steps...
]
Which, as you've experienced, will fail on EMR 4.x if you just swap command-runner.jar for script-runner.jar, there.
Instead, make a bootstrap action to call the script, like:
BootstrapActions=[
{
'Name': 'Install My Spark Drivers',
'ScriptBootstrapAction': {
'Path': 's3://my-bucket/spark-driver-install.sh',
'Args': []
}
}
]
The above example is expressed as boto3 run_job_flow kwargs. It's not immediately obvious to me how to accomplish the same thing in the web console, though.

For executing a script you can use script-runner. I was also facing the same issue. My script had ^M characters which was causing this issue. Removing those worked.

Exception during wordcount in Hadoop

I have installed Hadoop successfully and now I want to run Wordcount.jar. As shown below, my source address is /user/amir/dft/pg5000.txt and destination address to save results is /user/amir/dft/output.txt.
I have downloaded the .jar file from this url.
Now I'm facing this error message when I run the below command. I followed the instructions found at this url and now my problem is on "Run the MapReduce job" step. How can I overcome it?
amir#amir-Aspire-5820TG:/usr/local/hadoop$ bin/hadoop jar /usr/local/hadoop/wordcount.jar wordcount /user/amir/dft/pg5000.txt /user/amir/dft/output.txt
Exception in thread "main" java.lang.ClassNotFoundException: wordcount
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.hadoop.util.RunJar.main(RunJar.java:205)
amir#amir-Aspire-5820TG:/usr/local/hadoop$

It means u have a typo or somthing wrong with the main class your specifying. Do u mean org.apache.hadoop.examples.WordCount instead of wordcount.

You don't need to download a new .jar file. A wordcount jar is already there in the examples of hadoop. Just use the command:
bin/hadoop jar hadoop*examples*.jar wordcount /user/amir/dft /user/amir/dft-output
The input and output paths should be directories on HDFS, not files. This will run the wordcount program on all the files that are uploaded on HDFS under the /user/amir/dft/ path, (including your pg5000.txt file).
EDIT: If you want to run this specific jar that you have downloaded, though, follow #samthebest's answer (keeping in mind that the input&output paths are directories).
EDIT2: Following the comments of this answer, it seems that the hadoop version used is newer than the one described in the tutorial. So the .jar for the wordcount program is located at the path hadoop_root/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar, as mentioned in this post.

Pentaho RowListener ClassNotFoundException

I am trying to perform map reduce on Pentaho 5. For Pentaho 5, the Pentaho applications come pre-configured for Apache Hadoop 0.20.2 and it says no further configuration is required for this version. I installed Hadoop 0.20.2 on windows using cygwin and every thing works fine. I run a simple job on Pentaho, which copies file in HDFS which finished successfully and the file system was copied into HDFS. But as soon as I run map reduce job, it says the job was finished on pentaho but the map reduce task was failed and on the output directory on HDFS the result is missing and the log file says:
Error: java.lang.ClassNotFoundException: org.pentaho.di.trans.step.RowListener
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:807)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833)
at org.apache.hadoop.mapred.JobConf.getMapRunnerClass(JobConf.java:790)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170
Please Help me out.

Maybe a little bit old but I thought it might help someone.
This can be caused by:
For Hadoop version > 0.20: check if you set up your environment correctly, see Pentaho Support : Creating a New Hadoop Configuration
Check on HDFS if you have an /opt/pentaho/mapreduce/ and check folder permissions on HDFS for /opt/pentaho (Did you find the kettle-*.jar files in lib folder?)
Check classpath separator ("," in Windows, ":" in Linux). In order to change it, edit spoon.sh (or spoon.bat) and modify OPT variable like this: OPT="$OPT -Dhadoop.cluster.path.separator=,"

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Run co-occurrence algorithm in hadoop - java

You should give jar the path to the jar including the Cooccurrence class. Jar is in target folder (cooc-1.0-SNAPSHOT.jar). You don't need to indicate the class name as it is set up in the Manifest file

Related

Error Running Hbase Java API

ClassNotFoundException when running hadoop jar

EMR-4.2.0 Error during running of custom jar (command-runner)

Exception during wordcount in Hadoop

Pentaho RowListener ClassNotFoundException

Categories

Resources