Error using sphinx4 jars without Maven - java

I have a problem with the API Sphinx4 and I can't figure out why it doesn't work.
I try to write a little class for capture the voice of an user and write his speaking on a file.
1) I have create a new java project on Eclispe.
2) I have create the class TranscriberDemo.
3) I have create a folder "file".
4) I have copy the folder "en-us" and the files "cmudict-en-us.dict", "en-us.lm.dmp", "10001-90210-01803.wav" on the folder "file".
5) I don't use maven, so I have just include the jar files "sphinx4-core-1.0-SNAPSHOT.jar" and "sphinx4-data-1.0-SNAPSHOT.jar".
you can download them here:
core: https://1fichier.com/?f3y6vqupdr
data: https://1fichier.com/?lpzz8jyerv
I know that the source code is available
here: https://github.com/erka/sphinx-java-api
or here: http://sourceforge.net/projects/cmusphinx/files/sphinx4
But I don't use maven so I can't compile them.
My class:
import java.io.InputStream;
import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;
import edu.cmu.sphinx.result.WordResult;
public class TranscriberDemo
{
public static void main(String[] args) throws Exception
{
System.out.println("Loading models...");
Configuration configuration = new Configuration();
// Load model from the jar
configuration.setAcousticModelPath("file:en-us");
configuration.setDictionaryPath("file:cmudict-en-us.dict");
configuration.setLanguageModelPath("file:en-us.lm.dmp");
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
InputStream stream = TranscriberDemo.class.getResourceAsStream("file:10001-90210-01803.wav");
stream.skip(44);
// Simple recognition with generic model
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null)
{
System.out.format("Hypothesis: %s\n", result.getHypothesis());
System.out.println("List of recognized words and their times:");
for (WordResult r : result.getWords())
{
System.out.println(r);
}
System.out.println("Best 3 hypothesis:");
for (String s : result.getNbest(3))
System.out.println(s);
}
recognizer.stopRecognition();
}
}
My log:
Loading models...
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Function
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at edu.cmu.sphinx.util.props.ConfigurationManager.getPropertySheet(ConfigurationManager.java:91)
at edu.cmu.sphinx.util.props.ConfigurationManagerUtils.listAllsPropNames(ConfigurationManagerUtils.java:556)
at edu.cmu.sphinx.util.props.ConfigurationManagerUtils.setProperty(ConfigurationManagerUtils.java:609)
at edu.cmu.sphinx.api.Context.setLocalProperty(Context.java:198)
at edu.cmu.sphinx.api.Context.setAcousticModel(Context.java:88)
at edu.cmu.sphinx.api.Context.<init>(Context.java:61)
at edu.cmu.sphinx.api.Context.<init>(Context.java:44)
at edu.cmu.sphinx.api.AbstractSpeechRecognizer.<init>(AbstractSpeechRecognizer.java:37)
at edu.cmu.sphinx.api.StreamSpeechRecognizer.<init>(StreamSpeechRecognizer.java:35)
at TranscriberDemo.main(TranscriberDemo.java:27)
Caused by: java.lang.ClassNotFoundException: com.google.common.base.Function
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 12 more
Thanks for your help =)

There are multiple issues with your code and your actions:
3) I have create a folder "file".
Not needed
4) I have copy the folder "en-us" and the files "cmudict-en-us.dict", "en-us.lm.dmp", "10001-90210-01803.wav" on the folder "file".
Not needed, you already have models as part of sphinx4-data package.
5) I don't use maven, so I have just include the jar files "sphinx4-core-1.0-SNAPSHOT.jar" and "sphinx4-data-1.0-SNAPSHOT.jar".
This is very wrong because you took outdated jars from unauthorized location. The right place to download jars is listed in tutorial http://oss.sonatype.org
https://oss.sonatype.org/service/local/repositories/snapshots/content/edu/cmu/sphinx/sphinx4-core/1.0-SNAPSHOT/sphinx4-core-1.0-20150223.210646-7.jar
https://oss.sonatype.org/service/local/repositories/snapshots/content/edu/cmu/sphinx/sphinx4-data/1.0-SNAPSHOT/sphinx4-data-1.0-20150223.210601-7.jar
You took malicious jars from some random website which might have a virus or rootkit in them.
here: https://github.com/erka/sphinx-java-api
This is a wrong link too. The correct link is http://github.com/cmusphinx/sphinx4
InputStream stream = TranscriberDemo.class.getResourceAsStream("file:10001-90210-01803.wav");
Here you use file: URL scheme which points to files in inappropriate context. If you want to create InputStream from file do like this:
InputStream stream = new FileInputStream(new File("10001-90210-01803.wav"));
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Function
This error is caused by the fact you took a jar from other place and it said you need additional dependencies. When you see ClassDefFoundError it means you need to add additional jar into your classpath. With official sphinx4 you should not see this error.

Solved.
In fact it was a silly mistake...
Thank you #Nikolay for your answer. I already accept your answer but I resume the process here:
1) Download the sphinx4-core and sphinx4-data jars from https://oss.sonatype.org/#nexus-search;quick~sphinx4.
2) Include them in your project.
3) Test your code.
import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.LiveSpeechRecognizer;
import edu.cmu.sphinx.api.SpeechResult;
public class SpeechToText
{
public static void main(String[] args) throws Exception
{
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.dmp");
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
recognizer.startRecognition(true);
SpeechResult result;
while ((result = recognizer.getResult()) != null)
{
System.out.println(result.getHypothesis());
}
recognizer.stopRecognition();
}
}
And that is all!
If you need the source code of Sphinx4: https://github.com/cmusphinx/sphinx4

Related

java.lang.NoClassDefFoundError when trying to load class from JAR

I am working on a project that is supposed to parse texts from PDF files.
Having multiple dependencies I have decided to build a combined JAR with all the dependencies and the classes.
However, when I build JAR including dependencies via Intellij IDEA even though the JAR file is added properly and I can import the class the program throws NoClassDefFoundError (Please refer to the screenshot).
Firstly, I thought the jar wasn't in the classpath. However, even if I add -cp TessaractPDF.jar through VM Options the class still get undetected.
I think it is worth to mention that, everything works smoothly if I build JAR without dependencies and add the dependencies manually.
What should I do?
Exception in thread "main" java.lang.NoClassDefFoundError: me/afifaniks/parsers/TessPDFParser
at Test.main(Test.java:20)
Caused by: java.lang.ClassNotFoundException: me.afifaniks.parsers.TessPDFParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 1 more
Code Snippet:
import me.afifaniks.parsers.TessPDFParser;
import java.io.IOException;
import java.util.HashMap;
public class Test {
public static void main(String[] args) throws IOException {
System.out.println(System.getProperty("java.classpath"));
HashMap<String, Object> arguments = new HashMap<>();
arguments.put("imageMode", "binary");
arguments.put("toFile", false);
arguments.put("tessDataPath", "/home/afif/Desktop/PDFParser/tessdata");
TessPDFParser pdfParser = new TessPDFParser("hiers15.pdf", arguments);
String text = (String) pdfParser.convert();
System.out.println(text);
}
}

cmu sphinx4 java - Runtime exception caused by FileNotFoundException

I have recently made a Java project with Sphinx4. I found this code online, and I slimmed it down to this to test if Sphinx4 was working:
public class App
{
private static final String ACOUSTIC_MODEL =
"resource:/edu/cmu/sphinx/models/en-us/en-us";
private static final String DICTIONARY_PATH =
"resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict";
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
configuration.setAcousticModelPath(ACOUSTIC_MODEL);
configuration.setDictionaryPath(DICTIONARY_PATH);
configuration.setGrammarName("dialog");
LiveSpeechRecognizer jsgfRecognizer =
new LiveSpeechRecognizer(configuration);
jsgfRecognizer.startRecognition(true);
while (true) {
String utterance = jsgfRecognizer.getResult().getHypothesis();
if (utterance.startsWith("hello")) {
System.out.println("Hello back!");
}
else if (utterance.startsWith("exit")) {
break;
}
}
jsgfRecognizer.stopRecognition();
}
}
However, it gave me this error:
Exception in thread "main" java.lang.RuntimeException: Allocation of search manager resources failed
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:247)
at edu.cmu.sphinx.decoder.AbstractDecoder.allocate(AbstractDecoder.java:103)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:164)
at edu.cmu.sphinx.api.LiveSpeechRecognizer.startRecognition(LiveSpeechRecognizer.java:47)
at com.weebly.controllingyourcomputer.bartimaeus.App.main(App.java:27)
Caused by: java.io.FileNotFoundException:
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
at java.net.URL.openStream(URL.java:1038)
at edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel.open(SimpleNGramModel.java:403)
at edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel.load(SimpleNGramModel.java:277)
at edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel.allocate(SimpleNGramModel.java:114)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:334)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:243)
... 4 more
I thought it might be something about it not being able to find the paths for ACOUSTIC_MODEL or DICTIONARY_PATH, so I changed the resource: strings to things like %HOME%\\Downloads\\sphinx4-5prealpha-src\\sphinx4-5prealpha-src\\sphinx4-data\\src\\main\\resources\\edu\\cmu\\sphinx\\models\\en-us or paths with forward slashes or with C:\Users\Username\... but none of the paths worked. I know the paths exist because I copy and pasted them from the properties window of the actual resources.
So my question is: is it some of the code that I deleted from the original source code that is causing this error, is it something wrong with the paths, or is it entirely different?
EDIT
By the way, I am using Maven to build my project. I added the dependencies specified on the Sphinx4 website to my pom.xml, but it didn't work (it didn't recognize imports such as edu.com.sphinx.xxx) so I downloaded the JARs from the website they said to download them from and added them to my projects "Libraries" in my Java Build Path in Eclipse.
is it some of the code that I deleted from the original source code that
is causing this error
Yes, you deleted too much.
To recognize with grammar you need to make three calls:
configuration.setGrammarPath(GRAMMAR_PATH);
configuration.setGrammarName(GRAMMAR_NAME);
configuration.setUseGrammar(true);

Executing Sample Flink Program in Local

I am trying to execute a sample program in Apache Flink in local mode.
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
public class WordCountExample {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> text = env.fromElements(
"Who's there?",
"I think I hear them. Stand, ho! Who's there?");
//DataSet<String> text1 = env.readTextFile(args[0]);
DataSet<Tuple2<String, Integer>> wordCounts = text
.flatMap(new LineSplitter())
.groupBy(0)
.sum(1);
wordCounts.print();
env.execute();
env.execute("Word Count Example");
}
public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
}
It is giving me exception :
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/InputFormat
at WordCountExample.main(WordCountExample.java:10)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.InputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 1 more
What am I doing wrong?
I have used the correct jars also.
flink-java-0.9.0-milestone-1.jar
flink-clients-0.9.0-milestone-1.jar
flink-core-0.9.0-milestone-1.jar
Adding the three Flink Jar files as dependencies in your project is not enough because they have other transitive dependencies, for example on Hadoop.
The easiest way to get a working setup to develop (and locally execute) Flink programs is to follow the quickstart guide which uses a Maven archetype to configure a Maven project. This Maven project can be imported into your IDE.
NoClassDefFoundError extends LinkageError
Thrown if the Java Virtual Machine or a ClassLoader instance tries to
load in the definition of a class (as part of a normal method call or
as part of creating a new instance using the new expression) and no
definition of the class could be found. The searched-for class
definition existed when the currently executing class was compiled,
but the definition can no longer be found.
Your code/jar dependent to hadoop. Found it here download jar file and add it in your classpath org.apache.hadoop.mapreduce.InputFormat
Firstly, the flink jar files which you have included in your project are not enough, include all the jar files which are present in the lib folder present under the flink's source folder.
Secondly, " env.execute();
env.execute("Word Count Example");" These lines of code are not required since you are just printing your dataset onto the console; you're not writing the output into a file(.txt, .csv etc.). So, better to remove these lines (Sometimes throws errors if included in code if not required (observed a lot of times))
Thirdly, while exporting the jar files for your Java Project from your IDE, don't forget to select your 'Main' class.
Hopefully, after making the above changes, your code works.

java.lang.NoClassDefFoundError (Java, Eclipse, Fuse-JNA, Ubuntu)

via eclipse, I am trying to run builtin example of file system (HelloFS.java) of fuse-jna, but it gives me java.lang.NoClassDefFoundError .
My source project is in /home/syed/workspace/HelloFS
fuse-jna class files are in home/syed/Downloads/fuse-jna-master/build/classes/main/net/fusejna
In eclipse, I added class folder via buildpath and also jre path in envirnment file. I attached snapshot below.
Please help me run this example in eclipse.
error:
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jna/Structure
at net.fusejna.FuseFilesystem.mount(FuseFilesystem.java:545)
at net.fusejna.FuseFilesystem.mount(FuseFilesystem.java:550)
at HelloFS.main(HelloFS.java:22)
Caused by: java.lang.ClassNotFoundException: com.sun.jna.Structure
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 3 more
here is code of builtin example file system (with not red underline, which i think means that eclipse build path is entered correctly, ):
import java.io.File;
import java.nio.ByteBuffer;
import net.fusejna.DirectoryFiller;
import net.fusejna.ErrorCodes;
import net.fusejna.FuseException;
import net.fusejna.StructFuseFileInfo.FileInfoWrapper;
import net.fusejna.StructStat.StatWrapper;
import net.fusejna.types.TypeMode.NodeType;
import net.fusejna.util.FuseFilesystemAdapterFull;
public class HelloFS extends FuseFilesystemAdapterFull
{
public static void main(String args[]) throws FuseException
{
/*if (args.length != 1) {
System.err.println("Usage: HelloFS <mountpoint>");
System.exit(1);
}*/
new HelloFS().log(true).mount("./testfs1");
}
private final String filename = "/hello.txt";
private final String contents = "Hello World!\n";
#Override
public int getattr(final String path, final StatWrapper stat)
{
if (path.equals(File.separator)) { // Root directory
stat.setMode(NodeType.DIRECTORY);
return 0;
}
if (path.equals(filename)) { // hello.txt
stat.setMode(NodeType.FILE).size(contents.length());
return 0;
}
return -ErrorCodes.ENOENT();
}
#Override
public int read(final String path, final ByteBuffer buffer, final long size, final long offset, final FileInfoWrapper info)
{
// Compute substring that we are being asked to read
final String s = contents.substring((int) offset,
(int) Math.max(offset, Math.min(contents.length() - offset, offset + size)));
buffer.put(s.getBytes());
return s.getBytes().length;
}
#Override
public int readdir(final String path, final DirectoryFiller filler)
{
filler.add(filename);
return 0;
}
}
This is envirnment file contents:
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
JAVA_HOME="/usr/lib/jvm/java-6-openjdk-i386"
This is fuse-jna classes path
I added /main folder
========================================================
#Viktor K. Thanks for the help,
the above mentioned error is fixed by downloading and adding com.sun.jna » jna to referece library
but now it shows me new error as
Dec 28, 2013 1:18:25 PM HelloFS getName
INFO: Method succeeded. Result: null
Dec 28, 2013 1:18:25 PM HelloFS getOptions
INFO: Method succeeded. Result: null
Exception in thread "main" java.lang.NoSuchMethodError: com.sun.jna.Platform.getOSType()I
at net.fusejna.Platform.init(Platform.java:39)
at net.fusejna.Platform.fuse(Platform.java:26)
at net.fusejna.FuseJna.init(FuseJna.java:113)
at net.fusejna.FuseJna.mount(FuseJna.java:172)
at net.fusejna.FuseFilesystem.mount(FuseFilesystem.java:545)
at net.fusejna.FuseFilesystem.mount(FuseFilesystem.java:550)
at HelloFS.main(HelloFS.java:22)
=======================================================
Hmmm
The one that I downloaded was not campatable I think,
in temp folder of fuse-jna
/home/syed/Downloads/fuse-jna-master/build/tmp/expandedArchives/jna-3.5.2.jar_r4n26u14up0smlb84ivcvfnke/
there was jna3.5.2 classes, I imported that to libraray, now its working fine.
My problem solved. Thanks a lot.
The problem is not in Fuse-JNA library. Fuse-JNA library is obviously dependent on jna library (can be found in public maven repository http://mvnrepository.com/artifact/com.sun.jna/jna). You need to add this library as dependency in your project. You can see that in your project's referenced libraries there is no com.sun.jna package available.
In general - if you want to use package A (in your case Fuse-JNA) and the package A depends on package B (in your case JNA) you have to add JNA package yourself as dependency to your project. In general it is very hard to find out what are all required dependencies of the packages that you want to use - therefore many projects are using maven (or any alternative like gradle). Check this if you want to learn more : Why maven? What are the benefits?. I strongly suggest to use a tool for dependency resolution (like maven) over manual dependency resolution.
Another approach is to download a fuse jar with all dependencies - if you believe that it is the only library you'll need. However, adding jar with dependencies can lead to a big disaster if you add later other dependencies. This could lead to dependencies conflict, which is hard to find problem.

parsing json input in hadoop java

My input data is in hdfs. I am simply trying to do wordcount but there is slight difference.
The data is in json format.
So each line of data is:
{"author":"foo", "text": "hello"}
{"author":"foo123", "text": "hello world"}
{"author":"foo234", "text": "hello this world"}
I only want to do wordcount of words in "text" part.
How do I do this?
I tried the following variant so far:
public static class TokenCounterMapper
extends Mapper<Object, Text, Text, IntWritable> {
private static final Log log = LogFactory.getLog(TokenCounterMapper.class);
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
try {
JSONObject jsn = new JSONObject(value.toString());
//StringTokenizer itr = new StringTokenizer(value.toString());
String text = (String) jsn.get("text");
log.info("Logging data");
log.info(text);
StringTokenizer itr = new StringTokenizer(text);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
But I am getting this error:
Error: java.lang.ClassNotFoundException: org.json.JSONException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Seems you forgot to embed the JSon library in your Hadoop job jar.
You can have a look there to see how you can build your job with the library:
http://tikalk.com/build-your-first-hadoop-project-maven
There are several ways to use external jars with your map reduce code:
Include the referenced JAR in the lib subdirectory of the submittable JAR: The job will unpack the JAR from this lib subdirectory into the jobcache on the respective TaskTracker nodes and point your tasks to this directory to make the JAR available to your code. If the JARs are small, change often, and are job-specific this is the preferred method. This is what #clement suggested in his answer.
Install the JAR on the cluster nodes. The easiest way is to place the JAR into $HADOOP_HOME/lib directory as everything from this directory is included when a Hadoop daemon starts. Note that a start stop will be needed to make this effective.
TaskTrackers will be using the external JAR, so you can provide it by modifying HADOOP_TASKTRACKER_OPTS option in the hadoop-env.sh configuration file and make it point to the jar. The jar needs to be present at the same path on all the nodes where task-tracker runs.
Include the JAR in the “-libjars” command line option of the hadoop jar … command. The jar will be placed in distributed cache and will be made available to all of the job’s task attempts. Your map-reduce code must use GenericOptionsParser. For more details read this blog post.
Comparison:
1 is a legacy method but discouraged because it has a large negative performance cost.
2 and #3 are good for private clusters but pretty lame practice as you cannot expect end users to do that.
4 is the most recommended option.
Read the main post from Cloudera).

Categories