How to use Sqoop in Java Program? - java

I know how to use sqoop through command line.
But dont know how to call sqoop command using java programs .
Can anyone give some code view?

You can run sqoop from inside your java code by including the sqoop jar in your classpath and calling the Sqoop.runTool() method. You would have to create the required parameters to sqoop programmatically as if it were the command line (e.g. --connect etc.).
Please pay attention to the following:
Make sure that the sqoop tool name (e.g. import/export etc.) is the first parameter.
Pay attention to classpath ordering - The execution might fail because sqoop requires version X of a library and you use a different version. Ensure that the libraries that sqoop requires are not overshadowed by your own dependencies. I've encountered such a problem with commons-io (sqoop requires v1.4) and had a NoSuchMethod exception since I was using commons-io v1.2.
Each argument needs to be on a separate array element. For example, "--connect jdbc:mysql:..." should be passed as two separate elements in the array, not one.
The sqoop parser knows how to accept double-quoted parameters, so use double quotes if you need to (I suggest always). The only exception is the fields-delimited-by parameter which expects a single char, so don't double-quote it.
I'd suggest splitting the command-line-arguments creation logic and the actual execution so your logic can be tested properly without actually running the tool.
It would be better to use the --hadoop-home parameter, in order to prevent dependency on the environment.
The advantage of Sqoop.runTool() as opposed to Sqoop.Main() is the fact that runTool() return the error code of the execution.
Hope that helps.
final int ret = Sqoop.runTool(new String[] { ... });
if (ret != 0) {
throw new RuntimeException("Sqoop failed - return code " + Integer.toString(ret));
}
RL

Find below a sample code for using sqoop in Java Program for importing data from MySQL to HDFS/HBase. Make sure you have sqoop jar in your classpath:
SqoopOptions options = new SqoopOptions();
options.setConnectString("jdbc:mysql://HOSTNAME:PORT/DATABASE_NAME");
//options.setTableName("TABLE_NAME");
//options.setWhereClause("id>10"); // this where clause works when importing whole table, ie when setTableName() is used
options.setUsername("USERNAME");
options.setPassword("PASSWORD");
//options.setDirectMode(true); // Make sure the direct mode is off when importing data to HBase
options.setNumMappers(8); // Default value is 4
options.setSqlQuery("SELECT * FROM user_logs WHERE $CONDITIONS limit 10");
options.setSplitByCol("log_id");
// HBase options
options.setHBaseTable("HBASE_TABLE_NAME");
options.setHBaseColFamily("colFamily");
options.setCreateHBaseTable(true); // Create HBase table, if it does not exist
options.setHBaseRowKeyColumn("log_id");
int ret = new ImportTool().run(options);
As suggested by Harel, we can use the output of the run() method for error handling. Hoping this helps.

There is a trick which worked out for me pretty well. Via ssh, you can execute the Sqoop command directly. Just you have to use is an SSH Java Library
This is independent of Java. You just need to include any SSH library and sqoop installed in the remote system you want to perform the import. Now connect to the system via ssh and execute the commands which will export data from MySQL to hive.
You have to follow this step.
Download sshxcute java library: https://code.google.com/p/sshxcute/
and Add it to the build path of your java project which contains the following Java code
import net.neoremind.sshxcute.core.SSHExec;
import net.neoremind.sshxcute.core.ConnBean;
import net.neoremind.sshxcute.task.CustomTask;
import net.neoremind.sshxcute.task.impl.ExecCommand;
public class TestSSH {
public static void main(String args[]) throws Exception{
// Initialize a ConnBean object, the parameter list is IP, username, password
ConnBean cb = new ConnBean("192.168.56.102", "root","hadoop");
// Put the ConnBean instance as parameter for SSHExec static method getInstance(ConnBean) to retrieve a singleton SSHExec instance
SSHExec ssh = SSHExec.getInstance(cb);
// Connect to server
ssh.connect();
CustomTask sampleTask1 = new ExecCommand("echo $SSH_CLIENT"); // Print Your Client IP By which you connected to ssh server on Horton Sandbox
System.out.println(ssh.exec(sampleTask1));
CustomTask sampleTask2 = new ExecCommand("sqoop import --connect jdbc:mysql://192.168.56.101:3316/mysql_db_name --username=mysql_user --password=mysql_pwd --table mysql_table_name --hive-import -m 1 -- --schema default");
ssh.exec(sampleTask2);
ssh.disconnect();
}
}

If you know the location of the executable and the command line arguments you can use a ProcessBuilder, this can then be run a separate Process that Java can monitor for completion and return code.

Please follow the code given by vikas it worked for me and include these jar files in classpath and import these packages
import com.cloudera.sqoop.SqoopOptions;
import com.cloudera.sqoop.tool.ImportTool;
Ref Libraries
Sqoop-1.4.4 jar /sqoop
ojdbc6.jar /sqoop/lib (for oracle)
commons-logging-1.1.1.jar hadoop/lib
hadoop-core-1.2.1.jar /hadoop
commons-cli-1.2.jar hadoop/lib
commmons-io.2.1.jar hadoop/lib
commons-configuration-1.6.jar hadoop/lib
commons-lang-2.4.jar hadoop/lib
jackson-core-asl-1.8.8.jar hadoop/lib
jackson-mapper-asl-1.8.8.jar hadoop/lib
commons-httpclient-3.0.1.jar hadoop/lib
JRE system library
1.resources.jar jdk/jre/lib
2.rt.jar jdk/jre/lib
3. jsse.jar jdk/jre/lib
4. jce.jar jdk/jre/lib
5. charsets,jar jdk/jre/lib
6. jfr.jar jdk/jre/lib
7. dnsns.jar jdk/jre/lib/ext
8. sunec.jar jdk/jre/lib/ext
9. zipfs.jar jdk/jre/lib/ext
10. sunpkcs11.jar jdk/jre/lib/ext
11. localedata.jar jdk/jre/lib/ext
12. sunjce_provider.jar jdk/jre/lib/ext
Sometimes u get error if your eclipse project is using JDK1.6 and the libraries you add are JDK1.7 for this case configure JRE while creating project in eclipse.
Vikas if i want to put the imported files into hive should i use options.parameter ("--hive-import") ?

Related

sun.misc.InvalidJarIndexException: Invalid index when importing from com.* package in Jython standalone

I'm getting an InvalidJarIndexException when trying to utilize Jython standalone JAR inside my application and I'm unable to figure out what I'm doing wrong.
As soon as I attempt to execute a Python script with an import statement for any Java class from a package starting with "com.", e.g.: "com.foo.Bar", the following exception is thrown (truncated):
Traceback (most recent call last):
File "<string>", line 1, in <module>
sun.misc.InvalidJarIndexException: Invalid index
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1152)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1062)
at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1032)
at sun.misc.URLClassPath.findResource(URLClassPath.java:225)
at java.net.URLClassLoader$2.run(URLClassLoader.java:572)
at java.net.URLClassLoader$2.run(URLClassLoader.java:570)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findResource(URLClassLoader.java:569)
at java.lang.ClassLoader.getResource(ClassLoader.java:1089)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:233)
at org.python.core.ClasspathPyImporter.tryClassLoader(ClasspathPyImporter.java:221)
at org.python.core.ClasspathPyImporter.makeEntry(ClasspathPyImporter.java:208)
at org.python.core.ClasspathPyImporter.makeEntry(ClasspathPyImporter.java:18)
at org.python.core.util.importer.getModuleInfo(importer.java:174)
at org.python.core.util.importer.importer_find_module(importer.java:98)
at org.python.core.ClasspathPyImporter.ClasspathPyImporter_find_module(ClasspathPyImporter.java:134)
at org.python.core.ClasspathPyImporter$ClasspathPyImporter_find_module_exposer.__call__(Unknown Source)
at org.python.core.PyBuiltinMethodNarrow.__call__(PyBuiltinMethodNarrow.java:48)
at org.python.core.imp.find_module(imp.java:761)
at org.python.core.imp.import_next(imp.java:1158)
at org.python.core.imp.import_module_level(imp.java:1350)
at org.python.core.imp.importName(imp.java:1528)
at org.python.core.ImportFunction.__call__(__builtin__.java:1285)
at org.python.core.PyObject.__call__(PyObject.java:433)
at org.python.core.__builtin__.__import__(__builtin__.java:1232)
at org.python.core.imp.importOneAs(imp.java:1564)
at org.python.pycode._pyx0.f$0(<string>:1)
at org.python.pycode._pyx0.call_function(<string>)
at org.python.core.PyTableCode.call(PyTableCode.java:173)
at org.python.core.PyCode.call(PyCode.java:18)
at org.python.core.Py.runCode(Py.java:1687)
at org.python.core.Py.exec(Py.java:1731)
at org.python.util.PythonInterpreter.exec(PythonInterpreter.java:268)
at com.so.Script.execute(Script.java:20)
Here's all I'm doing in my code (I'm actually invoking this via a Swing action on a JMenuItem calling new Script().execute(), which is most likely irrelevant):
package com.so;
import org.python.core.PyDictionary;
import org.python.core.PySystemState;
import org.python.util.PythonInterpreter;
public class Script {
public Script() {
}
public void execute() {
PyDictionary table = new PyDictionary();
PySystemState state = new PySystemState();
PythonInterpreter interp = new PythonInterpreter(table, state);
String script;
script = "" +
"import com.foo.Bar as Bar\n" +
"";
interp.exec(script);
}
}
It doesn't even matter that there is no such package/class in my classpath. But what baffles me the most is that when I, thinking this has to be classpath related, created a separate mock project with the exact same classpath (same JAR files from the same locations on disk), the other project works just fine when run and it executes the actual script.
What could I be doing wrong here?
This happens with Java 1.8u241 (x64) and both jython-standalone-2.7.2.jar and an earlier 2.7.1 version. The ClassLoader in the stack trace is attempting to resolve "com".
I found the culprit entirely by accident.
My broken project used an older (ancient) version of the Glazed Lists library, namely glazedlists-1.8.0_java15.jar. This JAR seems to be directly incompatible with jython-standalone-2.7.2.jar. As soon as you put them on the same classpath and attempt to execute a python script, which imports any Java package starting with "com.", you end up with an InvalidJarIndexException. Updating to a newer version of said JAR resolved the issue.
Therefore, if you encounter a similar exception trying to run Jython, I suggest you update all dependencies of your project to latest or newer versions of them. In fact this would probably be the solution, even if not using Jython at all.

Matlab installation (LD_LIBRARY_PATH) messes up other library files

I am trying to install Matlab on a Linux machine, but setting LD_LIBRARY_PATH (as the installation requires) breaks other library files. I am not an Linux expert, but I have tried several things and cannot get it working correctly. I have even contacted Matlab support, got the issue elevated to the dev team, and was basically told "haha sucks to suck". I have seen a few other people online have had the same issue, but either their questions were never answered or they had a slightly different problem and their solution didn't apply to me.
Installing on a VM running Ubuntu:
I set LD_LIBRARY_PATH as the instructions say, then it breaks network files. I can ping google.com, but I cannot nslookup google.com or visit it in a browser. Nslookup provides this error:
nslookup: /usr/local/MATLAB/MATLAB_Runtime/v90/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by /usr/lib/libdns.so.100)
03-Feb-2016 11:32:22.361 ENGINE_by_id failed (crypto failure)
03-Feb-2016 11:32:22.362 error:25070067:DSO support routines:DSO_load:could not load the shared library:dso_lib.c:244:
03-Feb-2016 11:32:22.363 error:260B6084:engine routines:DYNAMIC_LOAD:dso not found:eng_dyn.c:447:
03-Feb-2016 11:32:22.363 error:2606A074:engine routines:ENGINE_by_id:no such engine:eng_list.c:418:id=gost
(null): dst_lib_init: crypto failure
The installation worked though (I can run my Java programs that reference compiled Matlab functions). Unsetting LD_LIBRARY_PATH fixes the network files but then I can't run programs anymore.
Installing on EC2 instance:
On an EC2 instance it does not break the network files (nslookup is fine). Instead it messes up Python library files. Trying to use any aws cli command, I get the error:
File "/usr/bin/aws", line 19, in <module>
import awscli.clidriver
File "/usr/lib/python2.7/dist-packages/awscli/clidriver.py", line 16, in <module>
import botocore.session
File "/usr/lib/python2.7/dist-packages/botocore/session.py", line 25, in <module>
import botocore.config
File "/usr/lib/python2.7/dist-packages/botocore/config.py", line 18, in <module>
from botocore.compat import six
File "/usr/lib/python2.7/dist-packages/botocore/compat.py", line 139, in <module>
import xml.etree.cElementTree
File "/usr/lib64/python2.7/xml/etree/cElementTree.py", line 3, in <module>
from _elementtree import *
ImportError: PyCapsule_Import could not import module "pyexpat"
Printing sys.path in Python shows lib-dynload is already there though, so it doesn't seem to the problem.
And when trying to run the program, I get:
Exception in thread "main" java.lang.LinkageError: libXt.so.6: cannot open shared object file: No such file or directory
at com.mathworks.toolbox.javabuilder.internal.DynamicLibraryUtils.dlopen(Native Method)
at com.mathworks.toolbox.javabuilder.internal.DynamicLibraryUtils.loadLibraryAndBindNativeMethods(DynamicLibraryUtils.java:134)
at com.mathworks.toolbox.javabuilder.internal.MWMCR.<clinit>(MWMCR.java:1529)
at VectorAddExample.VectorAddExampleMCRFactory.newInstance(VectorAddExampleMCRFactory.java:48)
at VectorAddExample.VectorAddExampleMCRFactory.newInstance(VectorAddExampleMCRFactory.java:59)
at VectorAddExample.VectorAddClass.<init>(VectorAddClass.java:62)
at com.mypackage.Example.main(Example.java:13)
I'm at a brick wall and really have no clue how to proceed.
Maybe something else already needs LD_LIBRARY_PATH set to work. Make sure you prepend not overwrite:
export LD_LIBRARY_PATH=new/path:$LD_LIBRARY_PATH
Edit:
OK, if LD_LIBRARY_PATH was initially empty, this suggests that Matlab comes with shared libraries that are incompatible with your system ones:
nslookup: /usr/local/MATLAB/MATLAB_Runtime/v90/bin/glnxa64/libcrypto.so.1.0.0: no version information available (required by /usr/lib/libdns.so.100)
suggests that /usr/lib/libdns.so.100 needs libcrypto.so.1.0.0, which is now being resolved to the one that comes with MATLAB, which is incompatible.
You can check the dependencies of a dll by
ldd /usr/lib/libcrypto.so.1.0.0
and hopefully you can find a configuration that keeps both MATLAB and your system happy. Unfortunately, this may involve a lot of trial and error.
If there is no such configuration, you can try setting LD_LIBRARY_PATH only when you run MATLAB:
LD_LIBRARY_PATH=$MATLAB_LD_LIBRARY_PATH matlab
Edit 2:
Well, for the Python issue, it seems to boil down to pyexpat, which is a wrapper around the standard expat XML parser. Try doing (name guessed since I don't have a Linux right now):
ldd /usr/local/lib/python2.7/site-packages/libpyexpat.so
and see what that depends on. Probably, it will be libexpat.so, which is now being resolved to MATLAB's version.
try the following command:
export LD_LIBRARY_PATH=/usr/local/MATLAB/MATLAB_Runtime/v90/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v90/bin/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v90/sys/os‌​/glnxa64:$LD_LIBRARY_PATH
Perhaps not helpful for OP but if you are generating a python package with MATLAB, you could modify the generated __init__.py file MATLAB creates for your package.
Specifically, the generated __init__.py file contains the following line (as of MATLAB 2017a):
PLATFORM_DICT = {'Windows': ['PATH','dll',''], 'Linux': ['LD_LIBRARY_PATH','so','libmw'], 'Darwin': ['DYMCR_LIBRARY_PATH','dylib','libmw']}
For Linux platform, you could simply replace LD_LIBRARY_PATH with something else such as MCR_LIBRARY_PATH to prevent mucking with your shared libs.
sed -i -e 's/LD_LIBRARY_PATH/MCR_LIBRARY_PATH/g' /MY/PACKAGE/BUILD/PATH/__init__.py
Then obviously export MCR_LIBRARY_PATH before using python.

Java - com.cloudera.sqoop vs. org.apache.sqoop which to import from sqoop jar?

I am confused While import library (com.cloudera.sqoop and org.apache.sqoop) and get this in eclipse (jar included sqoop-1.4.4-hadoop200.jar ) -
The method run(com.cloudera.sqoop.SqoopOptions) in the type ImportTool is not applicable for the arguments
(org.apache.sqoop.SqoopOptions) with this two line (option parameter are added between this two lines)
SqoopOptions options = new SqoopOptions();
int ret = new ImportTool().run(options);
If I choose Cloudera method get deprecated but if I choose apache then run method doesn't accept the options argument. Here are the screenshots.
This is also related to my question I asked earlier (
Java - MySQL to Hive Import where MySQL Running on Windows and Hive Running on Cent OS (Horton Sandbox)).
There doesn't seem to be many changes between the two SqoopOptions implementations. You can view the diff here..
http://www.diffchecker.com/n342v2f6
I would suggest using the Cloudera SqoopOptions class with the apache ImportTool found at 'org.apache.sqoop.tool.ImportTool' directory, cause it accepts it and has most of the options available.

How to install apache mahout on mac?

I am new to Mahout. I want to install it and try it out. So far I have Maven3 and Java 1.6 installed and configured on my Mac. My question is:
Do I have to install Hadoop firstly before installing Mahout?
Some tutorials include installing Hadoop and some not which confuse me. I know Mahout is built on top of Hadoop. But not all of Mahout depends on Hadoop.
Can someone provide some useful detailed resources about installation?
http://chimpler.wordpress.com/2013/02/20/playing-with-the-mahout-recommendation-engine-on-a-hadoop-cluster/
http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/
these 2 links helped me get up and running on OSX. It's not strictly necessary to use hadoop with mahout, however almost certainly it would be useful to gain experience with both as you go, if you are planning to use in a scalable system ...
Giving another answer to this question now that it's two years later and I finally got an itemsimilarity command to run on a mac after a lot of cursing and some blood spilled... Hope this saves someone some time and misery. Except my coworkers! Your weakness disgusts me! Anyway...
First for the "do I need $FINICKY_BIG_DATA_PLATFORM" question, see:
http://mahout.apache.org/users/basics/algorithms.html
Hadoop and/or spark are not hard requirements, some algorithms run on a single machine. But, the algorithm you may be interested in may only run on hadoop and/or spark. The docs on recommendations also steer you pretty strongly toward running the spark based algorithms. They also encourage you to use the black box command line commands, which can have different arguments between the single machine and spark versions (itemsimilarity, for example). So you don't NEED it, but you'll probably still need it.
I tried brew installs of hadoop, apache-spark and mahout. If you use the absolute latest versions (mahout 0.11.0, apache-spark 1.4.1, hadoop 2.7.1), you may have some of these problems:
" Got error Cannot find Spark class path. Is 'SPARK_HOME' set? " To fix this, not only do you need to have that environment variable set (mine is set to "/usr/local/Cellar/apache-spark/1.4.1/libexec"), you also need the apparently now deprecated compute-classpath.sh script in ${SPARK_HOME}/bin/ . I had a 1.2.0 spark installation handy, so I lifted one from there.
Bonus gotcha, in that 1.2.0 install there are two compute-classpath.sh scripts, one is just a one-liner invoking the other. You will probably be happier if you copy over the "real" one, so use less to check.
" java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path " To fix this, the Internet will tell you to get a copy of libsnappyjava.jnilib , put it in /usr/lib/java and rename it libsnappyjava.dylib . I did "brew install snappy," which installed version 1.1.3 and included symlinks named libsnappy.dylib and libsnappy.jnilib. Note that these are just symlinks and that the names aren't quite right... So after copying and renaming the main lib file I at least got a new error, which brings us to...
" Exception in thread "main" java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I " The Internet was less forthcoming with suggestions. I did see one post saying that version 1.0.xxx didn't have whatever magic pony code but version 1.1.1.3 did. I went to http://central.maven.org/maven2/org/xerial/snappy/snappy-java/ , downloaded snappy-java-1.1.1.3.jar and dropped that as-is into /usr/lib/java , no name changes. This made the snappy errors go away and I could run a "mahout spark-itemsimilarity" command to completion, YMMV, this advice is provided as-is with no warranty.
Please note that snappy error induced despair may drive you to download the spark .tgz and build it from scratch. The build process will take up ~2 hours of your life that you will never get back and you will still get snappy errors at the end. Ultimately I could run the same command with this hand-built version as with the brew installed version, the snappy jar ended up being the main thing.
You don't need hadoop at all to try out mahout. Below is a sample code which take model as input from a file and will print recommendations.
package com.ml.recommend;
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
public class App {
public static void main(String[] args) throws IOException, TasteException {
DataModel model = new FileDataModel(new File("data.txt"));
UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(3,
userSimilarity, model);
Recommender recommender = new GenericUserBasedRecommender(model,
neighborhood, userSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations = cachingRecommender.recommend(
1000000000000006075L, 10);
System.out.println(recommendations);
}
}

IntelliJ Plugin - Run Console Command

I am new to plugin development for IntelliJ and would like to know, how I can execute a command in the command line from within my plugin.
I would like to call, for instance, the command "gulp" in the current projects root directory.
I already tried using
Runtime.getRuntime().exec(commands);
with commands like "cd C:\Users\User\MyProject" and "gulp", but it does not seem to work that way and I wonder, if the plugin API provides an easier method.
I know its a bit late (1 year later), but recently I was working on an IntelliJ plugin and I had the same issue and this is what I used and it works pretty well.
First, we need to create a list of commands that we need to execute:
ArrayList<String> cmds = new ArrayList<>();
cmds.add("./gradlew");
Then
GeneralCommandLine generalCommandLine = new GeneralCommandLine(cmds);
generalCommandLine.setCharset(Charset.forName("UTF-8"));
generalCommandLine.setWorkDirectory(project.getBasePath());
ProcessHandler processHandler = new OSProcessHandler(generalCommandLine);
processHandler.startNotify();
hence the generalCommandLine.setWorkDirectory is set to the project directory which could be equivalent to the terminal command cd path/to/dir/
The Runtime class provides exec(String[], String[], File) method where the last argument is working directory of the subprocess being launched.
The plugin API provides OSProcessHandler class (as well as other classes like ProcessAdapter) which can help to manage the subprocess, handle its output etc.
ProcessOutput result1 = ExecUtil.execAndGetOutput(generalCommandLine);
result1.getStdOut result1.getStdErr works
and
ScriptRunnerUtil.getProcessOutput(generalCommandLine, ScriptRunnerUtil.STDOUT_OUTPUT_KEY_FILTER, timeout);
both work pretty well
and they are built into intellij
import com.intellij.execution.process.ScriptRunnerUtil;
import com.intellij.execution.util.ExecUtil;

Categories