Hadoop Java - copy file from windows share folder server to HDFS - java

I want upload multiple file from Windows share folder server (e.g. //server_name/folder/)
to my HDFS using Java
list of methods I have tried
org.apache.hadoop.fs.FileUtil set input path = //server_name/folder/
it says java.io.FileNotFoundException: File //server_name/folder/ does not exist
FileSystem.copyFromLocalFile (i think this is from local hadoop server to hdfs server)
IOUtils.copyBytes same as fileUtil >> file does not exist
a simple File.renameTo same as fileUtil >> file does not exist
String source_path = "\\server_name\folder\xxx.txt";
String hdfs_path = "hdfs://HADOOP_SERVER_NAME:Port/myfile/xxx.txt";
File srcFile = new File(source_path);
File dstFile = new File(hdfs_path);
srcFile.renameTo(dstFile);
Do I need to create FTP or How about using FTPFileSystem?
Or anyone have better solution Or Sample Code
thank you

FileSystem has copyFromLocal method:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://abc:9000");
FileSystem fs= FileSystem.get(configuration);
fs.copyFromLocalFile(new Path("/source/directory/"),
new Path("/user/hadoop/dir"));

Related

Java JAR file runs on local machine but missing file on others

The JAR file consists of the ffmpeg.exe file and it can run normally on my machine without any problems. However, if I try to run it on another computer it would tell me that java.io.IOException: Cannot run program "ffmpeg.exe": CreateProcess error=2,The system cannot find the file specified from the stacktrace. The way I imported it was
FFMpeg ffmpeg = new FFMpeg("ffmpeg.exe"); //in res folder
...
//ffmpeg class
public FFMPEG(String ffmepgEXE) {
this.ffmepgEXE = ffmepgEXE;
}
The quick fix is you have to put ffmpeg.exe in the same folder with your .jar file.
If you want to read file from resources folder, you have to change this code:
URL resource = Test.class.getResource("ffmpeg.exe");
String filepath = Paths.get(resource.toURI()).toFile().getAbsolutePath();
FFMpeg ffmpeg = new FFMpeg(filepath);

Transfer file from SFTP to ADLS

We are currently in the process of exploring the sshj library to download a file from SFTP path into ADLS. We are using the example as reference.
We have already configured the ADLS Gen2 storage in Databricks to be accessed as an abfss URL.
We are using scala within Databricks.
How should we pass the abfss path as FileSystemFile object in the get step ?
sftp.get("test_file", new FileSystemFile("abfss://<container_name>#a<storage_account>.dfs.core.windows.net/<path>"));
Is the destination supposed to be a file path only or file path with file name?
Use streams. First obtain InputStream of the source SFTP file:
RemoteFile f = sftp.open(sftpPath);
InputStream is = f.new RemoteFileInputStream(0);
(How to read from the remote file into a Stream?)
Then obtain OutputStream of the destination file on ADLS:
OutputStream os = adlsStoreClient.createFile(adlsPath, IfExists.OVERWRITE);
(How to upload and download a file from my locale to azure adls using java sdk?)
And copy from the first to the other:
is.transferTo(os);
(Easy way to write contents of a Java InputStream to an OutputStream)

Can't access HDFS via Java API (Cloudera-CDH4.4.0)

I'm trying to access my HDFS using Java code but I can't get it working... after 2 days of struggling I think it's time to ask for help.
This is my code:
Configuration conf = new Configuration();
conf.addResource(new Path("/HADOOP_HOME/conf/core-site.xml"));
conf.addResource(new Path("/HADOOP_HOME/conf/hdfs-site.xml"));
FileSystem hdfs = FileSystem.get(conf);
boolean success = hdfs.mkdirs(new Path("/user/cloudera/testdirectory"));
System.out.println(success);
I got this code from here and here.
Unfortunately the hdfs object is just a "LocalFileSystem"-object, so something must be wrong. Looks like this is exactly what Rejeev wrote on his website:
[...] If you do not assign the configurations to conf object (using hadoop xml file) your HDFS operation will be performed on the local file system and not on the HDFS. [...]
With absolute paths I get the same result.
conf.addResource(new Path("/etc/hadoop/conf/core-site.xml"))
This is the libary I'm currently using:
hadoop-core-2.0.0-mr1-cdh4.4.0.jar
I heard that hadoop-core was split into multiple libs so I also tried the following libs:
hadoop-common-2.0.0-alpha.jar
hadoop-mapreduce-client-core-2.0.2-alpha.jar
I'm using Cloudera-CDH4.4.0 so hadoop is already installed. Via console everything is working fine.
For example:
hadoop fs -mkdir testdirectory
So everything should be set up correctly as per default.
I hope that you guys can help me... this stuff is driving me nuts! It's extremely frustrating to fail with such a simple task.
Many thanks in advance for any help.
Try this:
conf.set("fs.defaultFS", "file:///");
conf.set("mapreduce.framework.name", "local");
1) You don't need to conf.addResource unless you are overriding any configuration variables.
2) Hope you are creating a Jar file and running the jar file in command window and not in eclipse.
If you execute in eclipse, it will execute on local file system.
3) I ran below code and it worked.
public class Hmkdirs {
public static void main(String[] args)
throws IOException
{
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
boolean success = fs.mkdirs(new Path("/user/cloudera/testdirectory1"));
System.out.println(success);
}
}
4) To execute, you need to create a jar file, you can do that either from eclipse or command prompt
and execute the jar file.
command prompt jar file sample:
javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar:/usr/local/hadoop/lib/commons-cli-1.2.jar -d classes WordCount.java && jar -cvf WordCount.jar -C classes/ .
jar file execution via hadoop at command prompt.
hadoop jar hadoopfile.jar hadoop.sample.fileaccess.Hmkdirs
hadoop.sample.fileaccess is the package in which my class Hmkdirs exist. If your class exist in default package, you don't have to specify it, just the class is fine.
Update: You can execute from eclipse and still access hdfs, check below code.
public class HmkdirsFromEclipse {
public static void main(String[] args)
throws IOException
{
Configuration conf = new Configuration();
conf.addResource("/etc/hadoop/conf/core-site.xml");
conf.addResource("/etc/hadoop/conf/hdfs-site.xml");
conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020/");
conf.set("hadoop.job.ugi", "cloudera");
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
FileSystem fs = FileSystem.get(conf);
boolean success = fs.mkdirs(new Path("/user/cloudera/testdirectory9"));
System.out.println(success);
}
}
This is indeed a tricky bit of configuration, but this is essentially what you need to do:
Configuration conf = new Configuration();
conf.addResource("/etc/hadoop/conf/core-site.xml");
conf.addResource("/etc/hadoop/conf/hdfs-site.xml");
conf.set("fs.defaultFS", hdfs://[your namenode]);
conf.set("hadoop.job.ugi", [your user]
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
make sure you have hadoop-hdfs in your classpath, too.

Compare a remote file with a local file in java

I'm testing a web app with selenium and I run the test on a remote XP VM. In my test, I douwnload a file and I want to compare this file with a local one. Can I compare these two files :
This is a part of the code :
//the local file path
File file1 = new File("C:/export_transaction.csv");
//the remote file path
File file2 = new File("C:/Documents and Settings/Export_Transaction.csv");
Properties prop1 = new Properties();
Properties prop2 = new Properties();
prop1.load(new FileReader(file1));
prop2.load(new FileReader(file2));
Assert.assertEquals(prop1, prop2);
The problem is that I can't access to the remote file path. Is there a way to specify the path of the remote file ?
Please help !

File from a hadoop distributed cache is presented as directory

When using the DistributedCache in Hadoop, I manage to push the files from hdfs in the driver class like this:
FileSystem fileSystem = FileSystem.get(getConf());
DistributedCache.createSymlink(conf);
DistributedCache.addCacheFile(fileSystem.getUri().resolve("/dumps" + "#" + "file.txt"), job.getConfiguration());
Then, to read the file, in the setup() of Mapper I do:
Path localPaths[] = context.getLocalCacheFiles();
The file is located in the cache, under a path /tmp/solr-map-reduce/yarn-local-dirs/usercache/user/appcache/application_1398146231614_0045/container_1398146231614_0045_01_000004/file.txt. But when I read it, I get IOException: file is a directory.
How can one go about solving this?

Categories