I have a Java code sample that uploads a file to S3
File f = new File("/home/myuser/test");
TransferManager transferManager = new TransferManager(credentials);
MultipleFileUpload upload = transferManager.uploadDirectory("mybucket","test_folder",f,true);
I would actually like to upload from HDFS to S3. I don't want to do anything complicated, so I was wondering if I can use the code that I already have. So is there a way to transform a Hadoop FileSystem object to a File object? Something like this:
FileSystem fs = ... // file system from hdfs path
File f = fs.toFile()
Thanks,
Serban
There is no other way than downloading the HDFS file to your local file system if you want to use the File class. The reason is that File can only represent a local file on your HDD. However, from Java 7 onwards, you can use the Path object to obtain an input stream to your file on HDFS:
Configuration conf = new Configuration
// set the hadoop config files
conf.addResource(new Path("HADOOP_DIR/conf/core-site.xml"));
conf.addResource(new Path("HADOOP_DIR/conf/hdfs-site.xml"));
Path path = new Path("hdfs:///home/myuser/test")
FileSystem fs = path.getFileSystem(conf);
FSDataInputStream inputStream = fs.open(path)
// do what ever you want with the stream
fs.close();
Related
So after 36 hours of experimenting with this and that, I have finally managed to get a cluster up and running but now I am confused how I can write files to it using Java? A tutorial said this program should be used but I don't understand it at all and it doesn't work as well.
public class FileWriteToHDFS {
public static void main(String[] args) throws Exception {
//Source file in the local file system
String localSrc = args[0];
//Destination file in HDFS
String dst = args[1];
//Input stream for the file in local file system to be written to HDFS
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
//Get configuration of Hadoop system
Configuration conf = new Configuration();
System.out.println("Connecting to -- "+conf.get("fs.defaultFS"));
//Destination file in HDFS
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst));
//Copy file from local to HDFS
IOUtils.copyBytes(in, out, 4096, true);
System.out.println(dst + " copied to HDFS");
}
}
My confusion is how does this piece of code identify specifics of my cluster? How will it know where the masternode is and where the slavenodes are?
Furthermore when I run this code and provide some local file in source and leave destination blank/or provide a file name only the program writes the file back to my local storage and not the location that I defined as storage space for my namenodes and datanodes. Should I be providing this path manually? How does this work? Please suggest some blog that can help me understand it better or can get working with a smallest example.
First off, you'll need to add some Hadoop libraries to your classpath. Without those, no, that code won't work.
How will it know where the masternode is and where the slavenodes are?
From the new Configuration(); and subsequent conf.get("fs.defaultFS").
It reads the core-site.xml of the HADOOP_CONF_DIR environment variable and returns the address of the namenode. The client only needs to talk to the namenode to receive the locations of the datanodes, from which file blocks are written to
the program writes the file back to my local storage
It's not clear where you've configured the filesystem, but the default is file://, your local disk. You change this in the core-site.xml. If you follow the Hadoop documentation, the pseudo distributed cluster setup mentions this
It's also not very clear why you need your own Java code when simply hdfs dfs -put will do the same thing
I'm developing a small program that uploads and downloads files from my box account.
I looked at the docs about uploading files and I found this code:
BoxFolder rootFolder = BoxFolder.getRootFolder(api);
FileInputStream stream = new FileInputStream("My File.txt");
rootFolder.uploadFile(stream, "My File.txt");
stream.close();
I don't really understand how it works. Where can I put the path to the file I want to upload? Or should I use different code?
The constructor for FileInputStream takes in a path to a file. The example in the documentation is uploading a file with the path "./My File.txt" relative to the current directory.
To make it a bit clearer, here's an example using a full path:
BoxFolder rootFolder = BoxFolder.getRootFolder(api);
FileInputStream stream = new FileInputStream("/path/to/My File.txt");
rootFolder.uploadFile(stream, "My File.txt");
stream.close();
I have an hadoop job that outputs many parts to hdfs for example to some folder.
For example:
/output/s3/2014-09-10/part...
What is the best way, using s3 java api to upload those parts to signle file in s3
For example
s3:/jobBucket/output-file-2014-09-10.csv
As a possible solution there is an option to merge the parts and write the result to hdfs single file, but this will create a double I/O.
Using single reducer is not option as well
Thanks,
Try to use FileUtil#copyMerge method, it allows you to copy data between two file systems. Also i found S3DistCp tool that can copy data from HDFS to Amazon S3. You can specify --groupBy,(.*) option to merge the files.
Snippet for Spark process
void sparkProcess(){
SparkConf sparkConf = new SparkConf().setAppName("name");
JavaSparkContext sc = new JavaSparkContext(sparkConf)
Configuration hadoopConf = sc.hadoopConfiguration();
hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
String folderPath = "s3://bucket/output/folder";
String mergedFilePath = "s3://bucket/output/result.txt";
BatchFileUtil.copyMerge(hadoopConf, folderPath, mergedFilePath);
}
public static boolean copyMerge(Configuration hadoopConfig, String srcPath, String dstPath) throws IOException, URISyntaxException {
FileSystem hdfs = FileSystem.get(new URI(srcPath), hadoopConfig);
return FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null);
}
Use the java hdfs api to read the files, then use standard Java streamy type stuff to convert to a InputStream, then use
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/PutObjectRequest.html
See also
https://stackoverflow.com/a/11116119/1586965
I have created a java project in which I am using a properties file also which is created inside a java packgae named abcedf
so package name is abcdef which consists a class name abc.java and a property file named drg.properties ,now from class abc.java i am referring to that properties file as..
abc tt = new abc();
URL url = tt.getClass().getResource("./drg.properties");
File file = new File(url.getPath());
FileInputStream fileInput = new FileInputStream(file);
now this file is referred and my program runs successfully but when I am trying to make it executable jar then this property file is not referred
please advise what is went wrong while creating the property file.
Use
tt.getClass().getResourceAsStream("./drg.properties");
to access the property file inside a JAR. You will get an InputStream as returned object.
-------------------------------------------------
Here is an example to load the InputStream to Properties object
InputStream in = tt.getClass().getResourceAsStream("./drg.properties");
Properties properties = new Properties();
properties.load(in); // Loads content into properties object
in.close();
If your case, you can directly use, the InputStream instead of using FileInputStream
When you access "jarred" resource you can't access it directly as you access a resource on your HDD with new File() (because resource doesn't live uncompressed on your drive) but you have to access resource (stored in your application jar) using Class.getResourceAsStream()
Code will looks like (with java7 try-with-resource feature)
Properties p = new Properties();
try(InputStream is = tt.getClass().getResourceAsStream("./drg.properties")) {
p.load(is); // Loads content into p object
}
I have a question regarding to the Java File class. When I create a File instance, for example,
File aFile = new File(path);
Where does the instance aFile store in the computer? Or it stores in JVM? I mean is there a temp file stored in the local disk?
If I have an InputStream instance, and write it to a file by using OutputSteam, for example
File aFile = new File("test.txt");
OutputStream anOutputStream = new FileOutputStream(aFile);
byte aBuffer[] = new byte[1024];
while( ( iLength = anInputStream.read( aBuffer ) ) > 0)
{
anOutputStream.write( aBuffer, 0, iLength);
}
Now where does the file test.txt store?
Thanks in advance!
A File object isn't a real file at all - it's really just a filename/location, and methods which hook into the file system to check whether or not the file really exists etc. There's no content directly associated with the File instance - it's not like it's a virtual in-memory file, for example. The instance itself is just an object in memory like any other object.
Creating a File instance on its own does nothing to the file system.
When you create a FileOutputStream, however, that does affect whatever file system you're writing to. The File instance is relatively irrelevant though - you'd get the same effect from:
OutputStream anOutputStream = new FileOutputStream("test.txt");
It will write the file where you specify it with path arguement.
In your case, it will write it in the directory where you run your java class.
If you specify /test/myproject/myfile.txt
it will go in /test/myproject/myfile.txt
If you don't provide a path, it is in the current directory (ie: the directory where java.exe is executed from.) If you provide a full path, it is stored there.
Regardless, it is always stored in the filesystem, not in JVM memory.