Start a java application from Hadoop YARN

Start a java application from Hadoop YARN - java

I'm trying to run a java application from a YARN application (in detail: from the ApplicationMaster in the YARN app). All examples I found are dealing with bash scripts that are ran.
My problem seems to be that I distribute the JAR file wrongly to the nodes in my cluster. I specify the JAR as local resource in the YARN client.
Path jarPath2 = new Path("/hdfs/yarn1/08_PrimeCalculator.jar");
jarPath2 = fs.makeQualified(jarPath2);
FileStatus jarStat2 = null;
try {
jarStat2 = fs.getFileStatus(jarPath2);
log.log(Level.INFO, "JAR path in HDFS is "+jarStat2.getPath());
} catch (IOException e) {
e.printStackTrace();
}
LocalResource packageResource = Records.newRecord(LocalResource.class);
packageResource.setResource(ConverterUtils.getYarnUrlFromPath(jarPath2));
packageResource.setSize(jarStat2.getLen());
packageResource.setTimestamp(jarStat2.getModificationTime());
packageResource.setType(LocalResourceType.ARCHIVE);
packageResource.setVisibility(LocalResourceVisibility.PUBLIC);
Map<String, LocalResource> res = new HashMap<String, LocalResource>();
res.put("package", packageResource);
So my JAR is supposed to be distributed to the ApplicationMaster and be unpacked since I specify the ResourceType to be an ARCHIVE. On the AM I try to call a class from the JAR like this:
String command = "java -cp './package/*' de.jofre.prime.PrimeCalculator";
The Hadoop logs tell me when running the application: "Could not find or load main class de.jofre.prime.PrimeCalculator". The class exists at exactly the path that is shown in the error message.
Any ideas what I am doing wrong here?

I found out how to start a java process from an ApplicationMaster. Infact, my problem was based on the command to start the process even if this is the officially documented way provided by the Apache Hadoop project.
What I did no was to specify the packageResource to be a file not an archive:
packageResource.setType(LocalResourceType.FILE);
Now the node manager does not extract the resource but leaves it as file. In my case as JAR.
To start the process I call:
java -jar primecalculator.jar
To start a JAR without specifying a main class in command line you have to specify the main class in the MANIFEST file (Manually or let maven do it for you).
To sum it up: I did NOT added the resource as archive but as file and I did not use the -cp command to add the syslink folder that is created by hadoop for the extracted archive folder. I simply startet the JAR via the -jar parameter and that's it.
Hope it helps you guys!

Related

File Not Found - Spark standalone cluster

I have two machines named: ubuntu1 and ubuntu2.
In ubuntu1, I started the master node in Spark Standalone Cluster and ubuntu2 I started with a worker (slave).
I am trying to execute the example workCount available on github.
When I submit the application, the worker send an error message
java.io.FileNotFoundException: File file:/home/ubuntu1/demo/test.txt does not exist.
My command line is
./spark-submit --master spark://ubuntu1-VirtualBox:7077 --deploy-mode cluster --clas br.com.wordCount.App -v --name"Word Count" /home/ubuntu1/demo/wordCount.jar /home/ubuntu1/demo/test.txt
The file test.txt has only to stay in one machine ?
Note: The master and the worker are in different machine.
Thank you

I got the same problem while loading the JSON file. I recognized by default windows storing file format as Textfile regardless of the name. identify the file format then you can load easily.
example: think you saved the file as test.JSON. but by default windows adding .txt to it.
check that and try to run again.
I hope your problem will get resolved with this idea.
Thank you.

You should put your file on hdfs by going to the folder and typing :
hdfs dfs -put <file>
Otherwise each node has to have access to it by having the same path folder existing on each machine.
Don't forget to change file:/ to hdfs:/ after you do that

Get working directory of another Java process

I can get working directory of current Java program using this code:
Path path = Paths.get(*ClassName*.class.getProtectionDomain().getCodeSource().getLocation().toURI());
Also I can get CommandLine parameters (but there is no directory in the output) of running Java processes using this command wmic process get CommandLine where name='java.exe' /value
It is possible to get working directory of another Java process (better programmatically)? Probably it can be solved with some jdk/bin utilities?

You can get this information via the Attach API. To use it, you have to add the tools.jar of your jdk to your class path. Then, the following code will print the current working directories of all recognized JVM processes:
for(VirtualMachineDescriptor d: VirtualMachine.list()) {
System.out.println(d.id()+"\t"+d.displayName());
try {
VirtualMachine vm = VirtualMachine.attach(d);
try(Closeable c = vm::detach) {
System.out.println("\tcurrent dir: "+vm.getSystemProperties().get("user.dir"));
}
}
catch(AttachNotSupportedException|IOException ex) {
System.out.println("\t"+ex);
}
}

Hadoop log4j cannot find KafkaLog4JAppender.class

I added KafkaLog4JAppender functionality to my MR job.
locally the job is running and sending the formatted logs into my Kafka cluster.
when I try to run it from the yarn server, using:
jar [jar-name].jar [DriverClass].class [job-params] -Dlog4j.configuration=log4j.xml -libjars
I get the following expception:
log4j:ERROR Could not create an Appender. Reported error follows.
java.lang.ClassNotFoundException: kafka.producer.KafkaLog4jAppender
the KafkaLog4JAppender class is in the path.
running
jar tvf [my-jar].jar | grep KafkaLog4J
finds the class
I'm kinda lost and would appreciate any helpfull input
thanks in advance!

If it works in local mode and not working in Yarn/distributed mode, then it could be problem of jar not being distributed properly. YOu might want to check Using third part jars and files in your MapReduce application(Distributed cache) for details on how to distribute your jar containing KafkaLog4jAppender.class

Running shell script on tomcat7

I have been breaking my head for two days trying to fix the file permissions for my tomcat7 server. I have a library class (.jar file included in myapp/WEB-INF) which needs to run a shell script. The library is written by me and works fine within NetBeans ie. no hassle in creating,reading and deleting files. That is because NetBeans runs the program as blumonkey(my username on my Ubuntu System). But when I import this into tomcat and run it, tomcat "executes" the command, produces no definite output, tries to check for a file(which will be generated when the script succeeds) and throws a FileNotFoundException.
More Details as follows:
Tomcat7 installed using apt-get, has its data in 2 locations - /var/lib/tomcat7 with conf and webapps folders and /usr/share/tomcat7 with the bin and lib folders
The user uploads a .zip file which is stores to /home/blumonkey/data. Rest of the program runs on the documents stored here. All new folders/files uploaded by tomcat have, obviously, tomcat7 as the owner.
I have tried things like changing the ownership to blumonkey, adding tomcat7 to blumonkey user group but none of the methods worked (Somewhere around here I probably messed up changing permissions carelessly :/ ). Apparently tomcat7 is unable to process on the files it owns.(How can this be?).
The script works when I run it in the terminal. But it doesn't work when I do a sudo -u tomcat7 script.sh, ie run it as tomcat7. It just exits with no message. I doubt that this it what is happening as I have tried to debug by redirecting the errors and outputs in ProcessBuilder but they came empty.
Any help regarding how to fix the issue and get the script running would be greatly appreciated. Please comment if you need any more info.
The code for script execution
private static void RunShellCommandFromJava(String command,String fn, String arg1,String arg2) throws Exception
{
try
{
System.out.println(System.getProperty("user.name"));
ProcessBuilder pbuilder = new ProcessBuilder("/bin/bash",command,fn,arg1,arg2);
System.out.println(pbuilder.command());
pbuilder.redirectErrorStream(true);
Process p = pbuilder.start();
p.waitFor();
}
catch(Exception ie)
{
throw ie;
}
}
The command which needs to be executed
"/bin/bash /abs/path/to/script.sh /abs/path/to/doc/in/data-folder maxpages=30 maxsearches=3"
PS : I have followed this question but it didn't help. I also tried other options like Runtime.exec(), bash,/bin/bash/ and /bin/bash/ -c, some of them don't work at all, others give no results.

Try to use Runtime and check standard error to find out what was the problem (probably permissions or paths):
// run command
String[] fixCmd = new String[] { "/bin/bash", "/abs/path/to/script.sh", "/abs/path/to/doc/in/data-folder", "maxpages=30", "maxsearches=3" };
Process start = Runtime.getRuntime().exec(fixCmd);
// monitor standard error to find out what's wrong
BufferedReader r = new BufferedReader(new InputStreamReader(start.getErrorStream()));
String line = null;
while ((line = r.readLine()) != null) {
System.out.println(line);
}

Setting external jars to hadoop classpath

I am trying to set external jars to hadoop classpath but no luck so far.
I have the following setup
$ hadoop version
Hadoop 2.0.6-alpha
Subversion https://git-wip-us.apache.org/repos/asf/bigtop.git -r ca4c88898f95aaab3fd85b5e9c194ffd647c2109
Compiled by jenkins on 2013-10-31T07:55Z
From source with checksum 95e88b2a9589fa69d6d5c1dbd48d4e
This command was run using /usr/lib/hadoop/hadoop-common-2.0.6-alpha.jar
Classpath
$ echo $HADOOP_CLASSPATH
/home/tom/workspace/libs/opencsv-2.3.jar
I am able see the above HADOOP_CLASSPATH has been picked up by hadoop
$ hadoop classpath
/etc/hadoop/conf:/usr/lib/hadoop/lib/:/usr/lib/hadoop/.//:/home/tom/workspace/libs/opencsv-2.3.jar:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/:/usr/lib/hadoop-hdfs/.//:/usr/lib/hadoop-yarn/lib/:/usr/lib/hadoop-yarn/.//:/usr/lib/hadoop-mapreduce/lib/:/usr/lib/hadoop-mapreduce/.//
Command
$ sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/result
I tried with -libjars option as well
$ sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/result -libjars /home/tom/workspace/libs/opencsv-2.3.jar
The stacktrace
14/11/04 16:43:23 INFO mapreduce.Job: Running job: job_1415115532989_0001
14/11/04 16:43:55 INFO mapreduce.Job: Job job_1415115532989_0001 running in uber mode : false
14/11/04 16:43:56 INFO mapreduce.Job: map 0% reduce 0%
14/11/04 16:45:27 INFO mapreduce.Job: map 50% reduce 0%
14/11/04 16:45:27 INFO mapreduce.Job: Task Id : attempt_1415115532989_0001_m_000001_0, Status : FAILED
Error: java.lang.ClassNotFoundException: au.com.bytecode.opencsv.CSVParser
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at FlightsByCarrierMapper.map(FlightsByCarrierMapper.java:19)
at FlightsByCarrierMapper.map(FlightsByCarrierMapper.java:10)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:757)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
Any help is highly appreciated.

Your external jar is missing on the node running maps. You have to add it to the cache to make it available. Try :
DistributedCache.addFileToClassPath(new Path("pathToJar"), conf);
Not sure in which version DistributedCache was deprecated, but from Hadoop 2.2.0 onward you can use :
job.addFileToClassPath(new Path("pathToJar"));

If you are adding the external JAR to the Hadoop classpath then its better to copy your JAR to one of the existing directories that hadoop is looking at. On the command line run the command "hadoop classpath" and then find a suitable folder and copy your jar file to that location and hadoop will pick up the dependencies from there. This wont work with CloudEra etc as you may not have read/write rights to copy files to the hadoop classpath folders.
Looks like you tried the LIBJARs option as well, did you edit your driver class to implement the TOOL interface? First make sure that you edit your driver class as shown below:
public class myDriverClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new myDriverClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
// Configuration processed by ToolRunner
Configuration conf = getConf();
Job job = new Job(conf, "My Job");
...
...
return job.waitForCompletion(true) ? 0 : 1;
}
}
Now edit your "hadoop jar" command as shown below:
hadoop jar YourApplication.jar [myDriverClass] args -libjars path/to/jar/file
Now lets understand what happens underneath. Basically we are handling the new command line arguments by implementing the TOOL Interface. ToolRunner is used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool.
Within our Main() we are calling ToolRunner.run(new Configuration(), new myDriverClass(), args) - this runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. It uses the given Configuration, or builds one if it's null and then sets the Tool's configuration with the possibly modified version of the conf.
Now within the run method, when we call getConf() we get the modified version of the Configuration. So make sure that you have the below line in your code. If you implement everything else and still make use of Configuration conf = new Configuration(), nothing would work.
Configuration conf = getConf();

I tried setting the opencsv jar in the hadoop classpath but it didn't work.We need to explicitly copy the jar in the classpath for this to work.It did worked for me.
Below are the steps i followed:
I have done this in a HDP CLuster.I ahave copied my opencsv jar in hbase libs and exported it before running my jar
Copy ExternalJars to HDP LIBS:
To Run Open CSV Jar:
1.Copy the opencsv jar in directory /usr/hdp/2.2.9.1-11/hbase/lib/ and /usr/hdp/2.2.9.1-11/hadoop-yarn/lib
**sudo cp /home/sshuser/Amedisys/lib/opencsv-3.7.jar /usr/hdp/2.2.9.1-11/hbase/lib/**
2.Give the file permissions using
sudo chmod 777 opencsv-3.7.jar
3.List Files
ls -lrt
4.exporthadoop classpath:hbase classpath
5.Now run your Jar.It will pick up the opencsv jar and will execute properly.

I found another workaround by implementing ToolRunner like below. With this approach hadoop accepts command line options. We can avoid hard coding of adding files to DistributedCache
public class FlightsByCarrier extends Configured implements Tool {
public int run(String[] args) throws Exception {
// Configuration processed by ToolRunner
Configuration conf = getConf();
// Create a JobConf using the processed conf
JobConf job = new JobConf(conf, FlightsByCarrier.class);
// Process custom command-line options
Path in = new Path(args[1]);
Path out = new Path(args[2]);
// Specify various job-specific parameters
job.setJobName("my-app");
job.setInputPath(in);
job.setOutputPath(out);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new FlightsByCarrier(), args);
System.exit(res);
}
}

I found a very easy solution to the problem:
login as root :
cd /usr/lib
find . -name "opencsv.jar"
Pick up the locatin of the file. In my case >I found it under /usr/lib/hive/lib/opencsv*.jar
Now submit the command
hadoop classpath
The result shows the direcory where hadoop searches jar-files.
Pick up one directory and copy opencsv*jar to that directory.
In my case it worked.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Start a java application from Hadoop YARN - java

Related

File Not Found - Spark standalone cluster

Get working directory of another Java process

Hadoop log4j cannot find KafkaLog4JAppender.class

Running shell script on tomcat7

Setting external jars to hadoop classpath

Categories

Resources