Hadoop Hive UDF with external library - java

I'm trying to write a UDF for Hadoop Hive, that parses User Agents. Following code works fine on my local machine, but on Hadoop I'm getting:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String MyUDF .evaluate(java.lang.String) throws org.apache.hadoop.hive.ql.metadata.HiveException on object MyUDF#64ca8bfb of class MyUDF with arguments {All Occupations:java.lang.String} of size 1',
Code:
import java.io.IOException;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.*;
import com.decibel.uasparser.OnlineUpdater;
import com.decibel.uasparser.UASparser;
import com.decibel.uasparser.UserAgentInfo;
public class MyUDF extends UDF {
public String evaluate(String i) {
UASparser parser = null;
parser = new UASparser();
String key = "";
OnlineUpdater update = new OnlineUpdater(parser, key);
UserAgentInfo info = null;
info = parser.parse(i);
return info.getDeviceType();
}
}
Facts that come to my mind I should mention:
I'm compiling with Eclipse with "export runnable jar file" and extract required libraries into generated jar option
I'm uploading this "fat jar" file with Hue
Minimum working example I managed to run:
public String evaluate(String i) {
return "hello" + i.toString()";
}
I guess the problem lies somewhere around that library (downloaded from https://udger.com) I'm using, but I have no idea where.
Any suggestions?
Thanks, Michal

It could be a few things. Best thing is to check the logs, but here's a list of a few quick things you can check in a minute.
jar does not contain all dependencies. I am not sure how eclipse builds a runnable jar, but it may not include all dependencies. You can do
jar tf your-udf-jar.jar
to see what was included. You should see stuff from com.decibel.uasparser. If not, you have to build the jar with the appropriate dependencies (usually you do that using maven).
Different version of the JVM. If you compile with jdk8 and the cluster runs jdk7, it would also fail
Hive version. Sometimes the Hive APIs change slightly, enough to be incompatible. Probably not the case here, but make sure to compile the UDF against the same version of hadoop and hive that you have in the cluster
You should always check if info is null after the call to parse()
looks like the library uses a key, meaning that actually gets data from an online service (udger.com), so it may not work without an actual key. Even more important, the library updates online, contacting the online service for each record. This means, looking at the code, that it will create one update thread per record. You should change the code to do that only once in the constructor like the following:
Here's how to change it:
public class MyUDF extends UDF {
UASparser parser = new UASparser();
public MyUDF() {
super()
String key = "PUT YOUR KEY HERE";
// update only once, when the UDF is instantiated
OnlineUpdater update = new OnlineUpdater(parser, key);
}
public String evaluate(String i) {
UserAgentInfo info = parser.parse(i);
if(info!=null) return info.getDeviceType();
// you want it to return null if it's unparseable
// otherwise one bad record will stop your processing
// with an exception
else return null;
}
}
But to know for sure, you have to look at the logs...yarn logs, but also you can look at the hive logs on the machine you're submitting the job on ( probably in /var/log/hive but it depends on your installation).

such a problem probably can be solved by steps:
overide the method UDF.getRequiredJars(), make it returning a hdfs file path list which values are determined by where you put the following xxx_lib folder into your hdfs. Note that , the list mist exactly contains each jar's full hdfs path strings ,such as hdfs://yourcluster/some_path/xxx_lib/some.jar
export your udf code by following "Runnable jar file exporting wizard" (chose "copy required libraries into a sub folder next to the generated jar". This steps will result in a xxx.jar and a lib folder xxx_lib next to xxx.jar
put xxx.jar and the folders xxx_lib to your hdfs filesystem according to your code in step 0.
create a udf using: add jar ${the-xxx.jar-hdfs-path}; create function your-function as $}qualified name of udf class};
Try it. I test this and it works

Related

java.lang.ClassNotFoundException when running program on spark cluster

I have a spark scala program which loads a jar I wrote in java. From that jar a static function is called, which tried to read a serialized object from a file (Pattern.class), but throws a java.lang.ClassNotFoundException.
Running the spark program locally works, but on the cluster workers it doesn't. It's especially weird because before I try to read from the file, I instantiate a Pattern object and there are no problems.
I am sure that the Pattern objects I wrote in the file are the same as the Pattern objects I am trying to read.
I've checked the jar in the slave machine and the Pattern class is there.
Does anyone have any idea what the problem might be ? I can add more detail if it's needed.
This is the Pattern class
public class Pattern implements Serializable {
private static final long serialVersionUID = 588249593084959064L;
public static enum RelationPatternType {NONE, LEFT, RIGHT, BOTH};
RelationPatternType type;
String entity;
String pattern;
List<Token> tokens;
Relation relation = null;
public Pattern(RelationPatternType type, String entity, List<Token> tokens, Relation relation) {
this.type = type;
this.entity = entity;
this.tokens = tokens;
this.relation = relation;
if (this.tokens != null)
this.pattern = StringUtils.join(" ", this.tokens.toString());
}
}
I am reading the file from S3 the following way:
AmazonS3 s3Client = new AmazonS3Client(credentials);
S3Object confidentPatternsObject = s3Client.getObject(new GetObjectRequest("xxx","confidentPatterns"));
objectData = confidentPatternsObject.getObjectContent();
ois = new ObjectInputStream(objectData);
confidentPatterns = (Map<Pattern, Tuple2<Integer, Integer>>) ois.readObject();
LE: I checked the classpath at runtime and the path to the jar was not there. I added it for the executors but I still have the same problem. I don't think that was it, as I have the Pattern class inside the jar that is calling the readObject function.
Would suggest this adding this kind method to find out the classpath resources before call, to make sure that everything is fine from caller's point of view
public static void printClassPathResources() {
final ClassLoader cl = ClassLoader.getSystemClassLoader();
final URL[] urls = ((URLClassLoader) cl).getURLs();
LOG.info("Print All Class path resources under currently running class");
for (final URL url : urls) {
LOG.info(url.getFile());
}
}
This is sample configuration spark 1.5
--conf "spark.driver.extraLibrayPath=$HADOOP_HOME/*:$HBASE_HOME/*:$HADOOP_HOME/lib/*:$HBASE_HOME/lib/htrace-core-3.1.0-incubating.jar:$HDFS_PATH/*:$SOLR_HOME/*:$SOLR_HOME/lib/*" \
--conf "spark.executor.extraLibraryPath=$HADOOP_HOME/*" \
--conf "spark.executor.extraClassPath=$(echo /your directory of jars/*.jar | tr ' ' ',')
As described by this Trouble shooting guide :Class Not Found: Classpath Issues
Another common issue is seeing class not defined when compiling Spark programs this is a slightly confusing topic because spark is actually running several JVM’s when it executes your process and the path must be correct for each of them. Usually this comes down to correctly passing around dependencies to the executors. Make sure that when running you include a fat Jar containing all of your dependencies, (I recommend using sbt assembly) in the SparkConf object used to make your Spark Context. You should end up writing a line like this in your spark application:
val conf = new SparkConf().setAppName(appName).setJars(Seq(System.getProperty("user.dir") + "/target/scala-2.10/sparktest.jar"))
This should fix the vast majority of class not found problems. Another option is to place your dependencies on the default classpath on all of the worker nodes in the cluster. This way you won’t have to pass around a large jar.
The only other major issue with class not found issues stems from different versions of the libraries in use. For example if you don’t use identical versions of the common libraries in your application and in the spark server you will end up with classpath issues. This can occur when you compile against one version of a library (like Spark 1.1.0) and then attempt to run against a cluster with a different or out of date version (like Spark 0.9.2). Make sure that you are matching your library versions to whatever is being loaded onto executor classpaths. A common example of this would be compiling against an alpha build of the Spark Cassandra Connector then attempting to run using classpath references to an older version.

Query exist-db from Java

i want to query existdb from Java. i know there are samples but where can i get the necessary packages to run the examples?
in the samples :
import javax.xml.transform.OutputKeys;
import org.exist.storage.serializers.EXistOutputKeys;
import org.exist.xmldb.EXistResource;
import org.xmldb.api.DatabaseManager;
import org.xmldb.api.base.Collection;
import org.xmldb.api.base.Database;
import org.xmldb.api.modules.XMLResource;
where can i get these ?
and what is the right standard connection string for exist-db? port number etc
and YES, i have tried to read the existdb documentation, but those are not really understandable for beginners. they are confusing.
All i want to do is write a Java class in eclipse that can connect to a exist-db and query an xml document.
Your question is badly written, and I think you are really not explaining what you are trying to do very well.
If you want the JAR files as dependencies directly for some project then you can download eXist and get them from there. Already covered several times here, which JAR files you need as dependencies is documented on the eXist website and links to that documentation have already been posted in this thread.
I wanted to add, that if you did want a series of simple Java examples that use Maven to resolve the dependencies (which takes away the hard work), then when we wrote the eXist book we provided just that in the Integration Chapter. It shows you how to use each of eXist's different APIs from Java for storing/querying/updating etc. You can find the code from that book chapter here: https://github.com/eXist-book/book-code/tree/master/chapters/integration. Included are the Maven project files to resolve all the dependencies and build and run the examples.
If the code is not enough for you, you might also want to consider purchasing the book and reading the Integration Chapter carefully, that should answer all of your questions.
i ended up with a maven project and imported some missing jars (like ws.commons etc) by manually installing them on maven.
the missing jars i copied from the existdb installation path on my local system.
then i got it to work.
from: http://exist-db.org/exist/apps/doc/devguide_xmldb.xml
There are several XML:DB examples provided in eXist's samples
directory . To start an example, use the start.jar jar file and pass
the name of the example class as the first parameter, for instance:
java -jar start.jar org.exist.examples.xmldb.Retrieve [- other
options]
Example: Retrieving a Document with XML:DB
import org.xmldb.api.base.*;
import org.xmldb.api.modules.*;
import org.xmldb.api.*;
import javax.xml.transform.OutputKeys;
import org.exist.xmldb.EXistResource;
public class RetrieveExample {
private static String URI = "xmldb:exist://localhost:8080/exist/xmlrpc";
/**
* args[0] Should be the name of the collection to access
* args[1] Should be the name of the resource to read from the collection
*/
public static void main(String args[]) throws Exception {
final String driver = "org.exist.xmldb.DatabaseImpl";
// initialize database driver
Class cl = Class.forName(driver);
Database database = (Database) cl.newInstance();
database.setProperty("create-database", "true");
DatabaseManager.registerDatabase(database);
Collection col = null;
XMLResource res = null;
try {
// get the collection
col = DatabaseManager.getCollection(URI + args[0]);
col.setProperty(OutputKeys.INDENT, "no");
res = (XMLResource)col.getResource(args[1]);
if(res == null) {
System.out.println("document not found!");
} else {
System.out.println(res.getContent());
}
} finally {
//dont forget to clean up!
if(res != null) {
try { ((EXistResource)res).freeResources(); } catch(XMLDBException xe) {xe.printStackTrace();}
}
if(col != null) {
try { col.close(); } catch(XMLDBException xe) {xe.printStackTrace();}
}
}
}
}
On the page http://exist-db.org/exist/apps/doc/deployment.xml#D2.2.6 a list of dependencies is included; unfortunately there is no link to this page on http://exist-db.org/exist/apps/doc/devguide_xmldb.xml (should be added);
The latest xmldb.jar documentation can be found on http://xmldb.exist-db.org/
All the jar files can be retrieved by installing eXist-db from the installer jar; the files are all in EXIST_HOME/lib/core
If you work with a maven project, try adding this to your pom.xml
<dependency>
<groupId>xmldb</groupId>
<artifactId>xmldb-api</artifactId>
<version>20021118</version>
</dependency>
Be aware that the release date is 2002.
Otherwise you can query exist-db via XML-RPC

Create Function Command for Derby Database

I am having trouble using Create Function Command for Derby Database.
To start with I tried
CREATE FUNCTION TO_DEGREES(RADIANS DOUBLE) RETURNS DOUBLE
PARAMETER STYLE JAVA NO SQL LANGUAGE JAVA
EXTERNAL NAME 'java.lang.Math.toDegrees'
and then
SELECT TO_DEGREES(3.142), BILLNO FROM SALEBILL
This works absolutely fine.
Now I tried making my own function like this :
package SQLUtils;
public final class TestClass
{
public TestClass()
{
}
public static int addNos(int val1, int val2)
{
return(val1+val2);
}
}
followed by
CREATE FUNCTION addno(no1 int, no2 int) RETURNS int
PARAMETER STYLE JAVA NO SQL LANGUAGE JAVA
EXTERNAL NAME 'SQLUtils.TestClass.addNos'
and then
SELECT addno(3,4), BILLNO FROM SALEBILL
This gives an Exception
Error code -1, SQL state 42X51: The class 'SQLUtils.TestClass' does not exist or is inaccessible. This can happen if the class is not public.
Error code 99999, SQL state XJ001: Java exception: 'SQLUtils.TestClass: java.lang.ClassNotFoundException'.
Line 6, column 1
I have made a jar file of the project containing the above Class. I may be wrong but the conclusion that I can draw from this is that this jar file needs to be in some classpath. But in which classpath and how to add it to a classpath, I am not able to understand.
I tried copying the jar file to jdk\lib folder, jre\lib folder, jdk\jre\lib folder but to no avail.
Can someone please point me in the right direction ?
I am using NetBeans IDE 7.1.2, jdk 1.7.0_09, Derby version 10.8.1.2 in Network mode. The applications and data are on a Server. I access them from Netbeans installed on client computer.

Calling Java from MATLAB?

I want Matlab program to call a java file, preferably with an example.
There are three cases to consider.
Java built-in libraries.
That is, anything described here. These items can simply be called directly. For example:
map = java.util.HashMap;
map.put(1,10);
map.put(2,30);
map.get(1) %returns 10
The only complication is the mapping Matlab performs between Matlab data types and Java data types. These mappings are described here (Matlab to Java) and here (Java to Matlab). (tl; dr: usually the mappings are as you would expect)
Precompiled *.jar files
You first need to add these to Matlab's java class path. You can do this dynamically (that is, per-Matlab session, with no required Matlab state), as follows:
javaaddpath('c:\full\path\to\compiledjarfile.jar')
You can also add these statically by editing the classpath.txt file. For more information use docsearch java class path.
Precompiled *.class files.
These are similar to *.jar file, except you need to add the directory containing the class file, rather than the class files themselves. For example:
javaaddpath('c:\full\path\to\directory\containing\class\files\')
%NOT THIS: javaaddpath('c:\full\path\to\directory\containing\class\files\classname.class')
Ok, I'll try to give a mini-example here. Either use the java functions right from the Matlab window as zellus suggests, or, if need permits, create your own java class. Here's an example:
package testMatlabInterface;
public class TestFunction
{
private double value;
public TestFunction()
{
value = 0;
}
public double Add(double v)
{
value += v;
return value;
}
}
Then turn it into a jar file. Assuming you put the file in a folder called testMatlabInterface, run this command at the command line:
jar cvf testMatlab.jar testMatlabInterface
Then, in Matlab, navigate to the directory where your testMatlab.jar file is located and run the command, import testMatlabInterface.* to import all the classes in the testMatlabInterface package. Then you may use the class like so:
>> methodsview testMatlabInterface.TestFunction
>> me = testMatlabInterface.TestFunction()
me =
testMatlabInterface.TestFunction#7e413c
>> me.Add(10)
ans =
10
>> me.Add(10)
ans =
20
>> me.Add(10)
ans =
30
Let me know if I can be of further assistance.

can I load user packages into eclipse to run at start up and how?

I am new to java and to the eclipse IDE.
I am running Eclipse
Eclipse SDK
Version: 3.7.1
Build id: M20110909-1335
On a windows Vista machine.
I am trying to learn from the book Thinking in Java vol4.
The author uses his own packages to reduce typing. However the author did not use Eclipse and this is where the problem commes in..
This is an example of the code in the book.
import java.util.*;
import static net.mindview.util.print.*;
public class HelloWorld {
public static void main(String[] args) {
System.out.println("hello world");
print("this does not work");
}
this is the contents of print.Java
//: net/mindview/util/Print.java
// Print methods that can be used without
// qualifiers, using Java SE5 static imports:
package net.mindview.util;
import java.io.*;
public class Print {
// Print with a newline:
public static void print(Object obj) {
System.out.println(obj);
}
// Print a newline by itself:
public static void print() {
System.out.println();
}
// Print with no line break:
public static void printnb(Object obj) {
System.out.print(obj);
}
// The new Java SE5 printf() (from C):
public static PrintStream
printf(String format, Object... args) {
return System.out.printf(format, args);
}
} ///:~
The error I get the most is in the statement.
Import static net.mindview.util.print.*;
On this staement the Eclipse IDE says it cannot resolve net
also on the
print("this does not work");
The Eclipse IDE says that the class print() does not exist for the class HelloWorld.
I have been trying to get these to work, but with only limited success, The autor uses another 32 of these packages through the rest of the book.
I have tried to add the directory to the classpath, but that seems to only work if you are using the JDK compiler. I have tried to add them as libraries and i have tried importing them into a package in a source file in the project. I have tried a few other things but cant remember them all now.
I have been able to make one of the files work, the print.java file I gave the listing for in this message. I did that by creating a new source folder then making a new package in that foldeer then importing the print.java file into the package.
But the next time I try the same thing it does not work for me.
What I need is a way to have eclipse load all these .java files at start up so when I need them for the exercises in the book they will be there and work for me, or just an easy way to make them work everytime.
I know I am not the only one that has had this problem I have seen other questions about it on google searches and they were also asking about the Thinking In Java book.
I have searched this site and others and am just not having any luck.
Any help with this or sugestions are welcome and very appreciated.
thank you
Ok I have tried to get this working as you said, I have started a new project and I removed the static from the import statement, I then created a new source folder, then I created a new package in the source folder. Then I imported the file system and selected the the net.mindview.util folder.
Now the immport statement no longer gives me an error. But the the print statement does, the only way to make the print statement work is to use its fully qualified name. Here is the code.
import net.mindview.util.*;
public class Hello2 {
public static void main(String[] args) {
Hello2 test = new Hello2();
System.out.println();
print("this dooes not work");
net.mindview.util.Print.print("this stinks");
}
}
The Error on the print statement is:
The method print(String) is undefined for the type Hello2
and if I try to run it the error I get is:
Exception in thread "main" java.lang.Error: Unresolved compilation problem:
The method print(String) is undefined for the type Hello2
at Hello2.main(Hello2.java:6)
The Statement::::: net.mindview.util.Print.print("this stinks") is the fully qualified print statement and it does not throw an error but it does totally defeat the purpose of the print.java file..
If you have any questions please ask Ill get back to you as soon as I can.
I've had similar issues. I solved it by following the steps below:
Click File->New->Java Project. Fill in UtilBuild for the ProjectName. Chose the option "Use project folder as root and click 'Finish'.
Right-click on UtilBuild in the PackageExplorer window and click New->package. For the Package Name, fill in net.mindview.util
Navigate within the unzipped Thinking In Java (TIJ) folder to TIJ->net\mindview\util. Here you will find all the source code (.java) files for util.
Select all the files in the net\mindview\util folder and drag them to the net.mindview.util package under UtilBuild in Eclipse. Chose the 'Copy Files' option and hit 'OK'.
You will probably already have the 'Build Automatically' option checked. If not, go to Project and click 'Build Automatically'. This will create the .class files from the .java source files.
In Eclipse, right-click on the project you were working on (the one where you couldn't get that blasted print() method to work!) Click Properties and Java Build Path->Libraries. Click 'Add Class Folder...' check the box for UtilBuild (the default location for the .class files).
I think the confusion here arises due to CLASSPATH. If you use Eclipse to build and run your code then Eclipse manages your CLASSPATH. (You don't have to manually edit CLASSPATH in the 'Environment Variables' part of your computer properties, and doing so changes nothing as far as Eclipse Build and Run are concerned.)
In order to call code that exists outside your current project (I will name this 'outside code' for convenience) you need to satisfy three things:
A. You need to have the .class files for that code (as .class files or inside a JAR)
B. You need to indicate in your source code where to look for the 'outside code'
C. You need to indicate where to start looking for the 'outside code'
In order to satisfy these requirements, in this example we:
A. Build the project UtilBuild which creates the .class files we need.
B. Add the statement import static net.mindview.util.Print.*; in our code
C. Add the Class Folder library in Eclipse (Java Build Path->Libraries).
You can investigate the effect of Step C by examining the .classpath file that lives directly in your project folder. If you open it in notepad you will see a line similar to the following:
<classpathentry kind="lib" path="/UtilBuild>
You should combine this with your import statement to understand where the compiler will look for the .class file. Combining path="/UtilBuild" and import static net.mindview.util.Print.*; tells us that the compiler will look for the class file in:
UtilBuild/net/mindview/util
and that it will take every class that we built from the Print.java file (Print.*).
NOTE:
There is no problem with the keyword static in the statement
import static net.mindview.util.Print.*;
static here just means that you don't have to give specify the class name from Print.java, just the methods that you want to call. If we omit the keyword static from the import statement, then we would need to qualify that print() method with the class it belongs to:
import net.mindview.util.Print.*;
//...
Print.print("Hello");
which is slightly more verbose than what is achieved with the static import.
OPINION:
I think most people new to Java will use Eclipse at least initially. The Thinking in Java book seems to assume you will do things via command line (hence it's guidance to edit environment variables in order to update CLASSPATH). This combined with using the util folder code from very early in the book I think is a source of confusion to new learners of the language. I would love to see all the source code organised into an Eclipse project and available for download. Short of that, it would be a nice touch to include the .class files in just the 'net/mindview/util' folder so that things would be a little easier.
U should import package static net.mindview.util not static net.mindview.util.Print
and you should extend the class Print to use its method.......
You should remove the static keyword from your import decleration, this: import static net.mindview.util.print.*; becomes this: import net.mindview.util.print.*;
If that also does not work, I am assuming you did the following:
Create your own project;
Start copying code directly from the book.
The problem seems to be that this: package net.mindview.util; must match your folder structure in your src folder. So, if your src folder you create a new package and name it net.mindview.util and in it you place your Print class, you should be able to get it working.
For future reference, you should always make sure that your package decleration, which is at the top of your Java class, matches the package in which it resides.
EDIT:
I have seen your edit, and the problem seems to have a simple solution. You declare a static method named print(). In java, static methods are accessed through the use of ClassName.methodName(). This: print("this dooes not work"); will not work because you do not have a method named print which takes a string argument in your Hello2 class. In java, when you write something of the sort methodName(arg1...), the JVM will look for methods with that signature (method name + parameters) in the class in which you are making the call and any other classes that your calling class might extend.
However, as you correctly noted, this will work net.mindview.util.Print.print("this stinks");. This is because you are accessing the static method in the proper way, meaning ClassName.methodName();.
So in short, to solve your problem, you need to either:
Create a method named print which takes a string argument in your Hello2 class;
Call your print method like so: Print.print("this stinks");
Either of these two solutions should work for you.
In my case I've dowloaded and decompressed the file TIJ4Example-master.zip. in eclipse workspace folder. The three packages : net.mindview.atunit, net.mindview.simple and net.mindview.util are in this point of the project :
and java programs runs with no problems (on the right an example of /TIJ4Example/src/exercises/E07_CoinFlipping.java)

Categories