How to add jars on hive shell in a java application - java

i am trying to add jar on the hive shell. I am aware of the global option on the server but my requirement is to add them per session on the hive shell.
I have used this class for the hdfs dfs commands to add the jars to the hdfs file system
This is what i have tried:
Created a folder on the hdfs /tmp
Add the file to hdfs filesystem using FileSystem.copyFromLocalFile method
(equivalent to the hdfs dfs -put myjar.jar /tmp
Set permissions on the file on fhe fs file system
Check that the jar was loaded to hdfs using the getFileSystem method
List files on the fs FileSystem using listFiles to confirm the jars are there.
This works and I have the jars loaded to hdfs but i cannot add jars to the hive session
When i am trying to add it in the hive shell, i am doing the following:
statement = setStmt(createStatement(getConnection()));
query = "add jar " + path;
statement.execute(query);
I am getting this error [For example path of /tmp/myjar.jar]:
Error while processing statement: /tmp/myjar.jar does not exist
Other permutations on the path such as
query = "add jar hdfs://<host>:<port>" + path;
query = "add jar <host>:<port>" + path;
results with an error.
command to list jars works (with no results)
query = "list jars";
ResultSet rs = statement.executeQuery(query);

I managed to solve this issue
The process failed because of the configuration of the FileSystem.
This object is where we upload the jars to, before adding them on the session.
This is how you init the FileSystem
FileSystem fs = FileSystem.newInstance(conf);
The object conf should have the properties of the hive server.
In order for the process to work, I needed to set the following parameter on the Configuration property
conf.set("fs.defaultFS", hdfsDstStr);

Related

Copy file from HDFS to local directory from within Java Code when run using spark-submit in cluster mode

I am working on a java program where some code generates a file and stores it on some HDFS path. Then, I need to bring that file on the local machine storage/NAS and store it there. I am using the below for the same:
Configuration hadoopConf = new Configuration();
FileSystem hdfs = FileSystem.get(hadoopConf);
Path srcPath = new Path("/some/hdfs/path/someFile.csv");;
Path destPath = new Path("file:///data/output/files/");
hdfs.copyToLocalFile(false, newReportFilePath, destPath, false);
This gives me below error:
java.io.IOException: Mkdirs failed to create file:/data/output (exists=false, cwd=file:/data7/yarn/some/other/path)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:926)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:907)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:368)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
.
.
.
Below is the command used to run the java application
spark-submit --master yarn --deploy-mode cluster ..............
I am new to Spark/Hadoop but from other couple of questions on SO and web, it seems that since its running in cluster mode, any machine can act as a driver and FileSystem.copyToLocalFile will point to any machine which will be acting as a driver.
Any suggestions on how I can bring that csv file to local machine would be appreciated.

how to run hadoop with external jar?

I'm going to run my map reduce project on Hadoop 2.9.0. I'm using xml-rpc package in my project as follow:
import org.apache.xmlrpc.*;
I put additional jars into lib folder and When I ran my project jar in Hadoop, It shows this error:
Error: java.lang.ClassNotFoundException: org.apache.xmlrpc.XmlRpcClient
I executed this command:
bin/hadoop jar MRV.jar SumMR /user/hadoop/input /user/hadoop/output -libjars lib/xmlrpc-2.0.1.jar: lib/commons-codec-1.10.jar
How to execute this command without error of ClassNotFoundException ?
private static void addJarToDistributedCache
(Class classToAdd, Configuration conf)
throws IOException {
// Retrieve jar file for class2Add
String jar = classToAdd.getProtectionDomain().
getCodeSource().getLocation().
getPath();
File jarFile = new File(jar);
// Declare new HDFS location
Path hdfsJar = new Path("/user/hadoopi/lib/"
+ jarFile.getName());
// Mount HDFS
FileSystem hdfs = FileSystem.get(conf);
// Copy (override) jar file to HDFS
hdfs.copyFromLocalFile(false, true,
new Path(jar), hdfsJar);
// Add jar to distributed classPath
DistributedCache.addFileToClassPath(hdfsJar, conf);
}

Backup and restore of Hsqldb database in java code

I am new in Hsqldb database. I want to know how to take backup and restore of Hsqldb database through java code.
Use the BACKUP DATABASE TO command.
Here is a link to the documentation:
HSQLDB System Management Documentation
I haven't tested this, but I imagine it's something along the lines of:
String backup = "BACKUP DATABASE TO " + "'" + filePath + "' BLOCKING";
PreparedStatement preparedStatement = connection.prepareStatement(backup);
preparedStatement.execute();
You'll want to wrap it in a try-catch block of course.
As far as restoring the db goes, I think you have to perform that while the database is offline using the DbBackupMain application. So you would issue this command at the command line:
java -cp hsqldb.jar org.hsqldb.lib.tar.DbBackupMain --extract tardir/backup.tar dbdir
Each HyperSQL database is called a catalog. There are three types of catalog depending on how the data is stored.
Types of catalog data :
mem: stored entirely in RAM - without any persistence beyond the JVM process's life
file: stored in filesystem files
res: stored in a Java resource, such as a Jar and always read-only
To back up a running catalog, obtain a JDBC connection and issue a BACKUP DATABASE command in SQL. In its most simple form, the command format below will backup the database as a single .tar.gz file to the given directory.
BACKUP DATABASE TO <directory name> BLOCKING [ AS FILES ]
The directory name must end with a slash to distinguish it as a directory, and the whole string must be in single quotes like so: 'subdir/nesteddir/'.
To back up an offline catalog, the catalog must be in shut down state. You will run a Java command like
java -cp hsqldb.jar org.hsqldb.lib.tar.DbBackupMain --save tardir/backup.tar dbdir/dbname
. In this example, the database is named dbname and is in the dbdir directory. The backup is saved to a file named backup.tar in the tardir directory.
where tardir/backup.tar is a file path to the *.tar or *.tar.gz file to be created in your file system, and dbdir/dbname is the file path to the catalog file base name.
You use DbBackup on your operating system command line to restore a catalog from a backup.
java -cp hsqldb.jar org.hsqldb.lib.tar.DbBackupMain --extract tardir/backup.tar dbdir
where tardir/backup.tar is a file path to the *.tar or *.tar.gz file to be read, and dbdir is the target directory to extract the catalog files into. Note that dbdir specifies a directory path, without the catalog file base name. The files will be created with the names stored in the tar file.
For more details refer
So in java + SPring + JdbcTemplate
Backup (On-line):
#Autowired
public JdbcTemplate jdbcTemplate;
public void mainBackupAndRestore() throws IOException {
...
jdbcTemplate.execute("BACKUP DATABASE TO '" + sourceFile.getAbsolutePath() + "' BLOCKING");
}
This will save .properties, .scripts and .lobs file to a tar in sourceFile.getAbsolutePath().
Restore:
DbBackupMain.main(new String[] { "--extract", baseDir.getAbsolutePath(),
System.getProperty("user.home") + "/restoreFolder" });
This will get files from baseDir.getAbsolutePath() and will put them in userHome/restoreFolder where you can check if all restore is OK.
lobs contains lob/blob data, scripts contains executed queries.

File from a hadoop distributed cache is presented as directory

When using the DistributedCache in Hadoop, I manage to push the files from hdfs in the driver class like this:
FileSystem fileSystem = FileSystem.get(getConf());
DistributedCache.createSymlink(conf);
DistributedCache.addCacheFile(fileSystem.getUri().resolve("/dumps" + "#" + "file.txt"), job.getConfiguration());
Then, to read the file, in the setup() of Mapper I do:
Path localPaths[] = context.getLocalCacheFiles();
The file is located in the cache, under a path /tmp/solr-map-reduce/yarn-local-dirs/usercache/user/appcache/application_1398146231614_0045/container_1398146231614_0045_01_000004/file.txt. But when I read it, I get IOException: file is a directory.
How can one go about solving this?

Start a java application from Hadoop YARN

I'm trying to run a java application from a YARN application (in detail: from the ApplicationMaster in the YARN app). All examples I found are dealing with bash scripts that are ran.
My problem seems to be that I distribute the JAR file wrongly to the nodes in my cluster. I specify the JAR as local resource in the YARN client.
Path jarPath2 = new Path("/hdfs/yarn1/08_PrimeCalculator.jar");
jarPath2 = fs.makeQualified(jarPath2);
FileStatus jarStat2 = null;
try {
jarStat2 = fs.getFileStatus(jarPath2);
log.log(Level.INFO, "JAR path in HDFS is "+jarStat2.getPath());
} catch (IOException e) {
e.printStackTrace();
}
LocalResource packageResource = Records.newRecord(LocalResource.class);
packageResource.setResource(ConverterUtils.getYarnUrlFromPath(jarPath2));
packageResource.setSize(jarStat2.getLen());
packageResource.setTimestamp(jarStat2.getModificationTime());
packageResource.setType(LocalResourceType.ARCHIVE);
packageResource.setVisibility(LocalResourceVisibility.PUBLIC);
Map<String, LocalResource> res = new HashMap<String, LocalResource>();
res.put("package", packageResource);
So my JAR is supposed to be distributed to the ApplicationMaster and be unpacked since I specify the ResourceType to be an ARCHIVE. On the AM I try to call a class from the JAR like this:
String command = "java -cp './package/*' de.jofre.prime.PrimeCalculator";
The Hadoop logs tell me when running the application: "Could not find or load main class de.jofre.prime.PrimeCalculator". The class exists at exactly the path that is shown in the error message.
Any ideas what I am doing wrong here?
I found out how to start a java process from an ApplicationMaster. Infact, my problem was based on the command to start the process even if this is the officially documented way provided by the Apache Hadoop project.
What I did no was to specify the packageResource to be a file not an archive:
packageResource.setType(LocalResourceType.FILE);
Now the node manager does not extract the resource but leaves it as file. In my case as JAR.
To start the process I call:
java -jar primecalculator.jar
To start a JAR without specifying a main class in command line you have to specify the main class in the MANIFEST file (Manually or let maven do it for you).
To sum it up: I did NOT added the resource as archive but as file and I did not use the -cp command to add the syslink folder that is created by hadoop for the extracted archive folder. I simply startet the JAR via the -jar parameter and that's it.
Hope it helps you guys!

Categories