How to upload a multiple files from hdfs to single s3 file? - java

I have an hadoop job that outputs many parts to hdfs for example to some folder.
For example:
/output/s3/2014-09-10/part...
What is the best way, using s3 java api to upload those parts to signle file in s3
For example
s3:/jobBucket/output-file-2014-09-10.csv
As a possible solution there is an option to merge the parts and write the result to hdfs single file, but this will create a double I/O.
Using single reducer is not option as well
Thanks,

Try to use FileUtil#copyMerge method, it allows you to copy data between two file systems. Also i found S3DistCp tool that can copy data from HDFS to Amazon S3. You can specify --groupBy,(.*) option to merge the files.

Snippet for Spark process
void sparkProcess(){
SparkConf sparkConf = new SparkConf().setAppName("name");
JavaSparkContext sc = new JavaSparkContext(sparkConf)
Configuration hadoopConf = sc.hadoopConfiguration();
hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
String folderPath = "s3://bucket/output/folder";
String mergedFilePath = "s3://bucket/output/result.txt";
BatchFileUtil.copyMerge(hadoopConf, folderPath, mergedFilePath);
}
public static boolean copyMerge(Configuration hadoopConfig, String srcPath, String dstPath) throws IOException, URISyntaxException {
FileSystem hdfs = FileSystem.get(new URI(srcPath), hadoopConfig);
return FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null);
}

Use the java hdfs api to read the files, then use standard Java streamy type stuff to convert to a InputStream, then use
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/PutObjectRequest.html
See also
https://stackoverflow.com/a/11116119/1586965

Related

Use Apache Commons VFS RAM file to avoid using file system with API requiring a file

There is a highly upvoted comment on this post:
how to create new java.io.File in memory?
where Sorin Postelnicu mentions using an Apache Commons VFS RAM file as a way to have an in memory file to pass to an API that requires a java.io.File (I am paraphrasing... I hope I haven't missed the point).
Based on reading related posts I have come up with this sample code:
#Test
public void working () throws IOException {
DefaultFileSystemManager manager = new DefaultFileSystemManager();
manager.addProvider("ram", new RamFileProvider());
manager.init();
final String rootPath = "ram://virtual";
manager.createVirtualFileSystem(rootPath);
String hello = "Hello, World!";
FileObject testFile = manager.resolveFile(rootPath + "/test.txt");
testFile.createFile();
OutputStream os = testFile.getContent().getOutputStream();
os.write(hello.getBytes());
//FileContent test = testFile.getContent();
testFile.close();
manager.close();
}
So, I think that I have an in memory file called ram://virtual/test.txt with contents "Hello, World!"
My question is: how could I use this file with an API that requires a java.io.File?
Java's File API always works with native file system. So there is no way of converting the VFS's FileObject to File without having the file present on the native file system.
But there is a way if your API can also work with InputStream. Most libraries usually have overloaded methods that take in InputStreams. In that case, following should work:
InputStream is = testFile.getContent().getInputStream();
SampleAPI api = new SampleApi(is);

How to write to HDFS programmatically?

So after 36 hours of experimenting with this and that, I have finally managed to get a cluster up and running but now I am confused how I can write files to it using Java? A tutorial said this program should be used but I don't understand it at all and it doesn't work as well.
public class FileWriteToHDFS {
public static void main(String[] args) throws Exception {
//Source file in the local file system
String localSrc = args[0];
//Destination file in HDFS
String dst = args[1];
//Input stream for the file in local file system to be written to HDFS
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
//Get configuration of Hadoop system
Configuration conf = new Configuration();
System.out.println("Connecting to -- "+conf.get("fs.defaultFS"));
//Destination file in HDFS
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst));
//Copy file from local to HDFS
IOUtils.copyBytes(in, out, 4096, true);
System.out.println(dst + " copied to HDFS");
}
}
My confusion is how does this piece of code identify specifics of my cluster? How will it know where the masternode is and where the slavenodes are?
Furthermore when I run this code and provide some local file in source and leave destination blank/or provide a file name only the program writes the file back to my local storage and not the location that I defined as storage space for my namenodes and datanodes. Should I be providing this path manually? How does this work? Please suggest some blog that can help me understand it better or can get working with a smallest example.
First off, you'll need to add some Hadoop libraries to your classpath. Without those, no, that code won't work.
How will it know where the masternode is and where the slavenodes are?
From the new Configuration(); and subsequent conf.get("fs.defaultFS").
It reads the core-site.xml of the HADOOP_CONF_DIR environment variable and returns the address of the namenode. The client only needs to talk to the namenode to receive the locations of the datanodes, from which file blocks are written to
the program writes the file back to my local storage
It's not clear where you've configured the filesystem, but the default is file://, your local disk. You change this in the core-site.xml. If you follow the Hadoop documentation, the pseudo distributed cluster setup mentions this
It's also not very clear why you need your own Java code when simply hdfs dfs -put will do the same thing

Create File instance with URI pointing to HDFS

Is it possible to create a file instance by putting the uri of my HDFS as File class's constructor? For example:
val conf = new Configuration()
conf.addResource(hdfsCoreSitePath)
conf.addResource(hdfsHDFSSitePath)
val uri = conf.get("fs.default.name")
val file = new File(uri + pathtothefile)
Then, with the file instance, I wish to access the file list with the functions provided by File class such as file.list() to returns an array of strings naming the files and directories in the directory denoted by this abstract pathname. I tried the code but it return null on the file.list().
The method below is not recommended as I am trying to writing the same codebase for normal file system and hdfs to achieve code reusable.
val fileSystem = FileSystem.get(conf)
val status = fileSystem.listStatus(new Path(filepath))
status.map(x => ...
The regular built-in Java/Scala File APIs will not work for HDFS files. The protocol and implementation are too different. You have to use the Hadoop API to access HDFS files as in your second example.
The good news, though, is that the Hadoop API will work for non-HDFS files (regular files). So that code is reusable. Just use a URI like: file:///foo/bar for a local file.
fs.default.name is deprecated. Try to use fs.defaultFS and make sure this property is available in the core-site.xml file you are referring using the below command
conf.addResource(hdfsCoreSitePath)
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/core-default.xml

Print the content of streams (Spark streaming) in Windows system

I want just to print the content of streams to console. I wrote the following code but it does not print anything. Anyone can help me to read text file as stream in Spark?? Is there a problem related to Windows system?
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("My app")
.setMaster("local[2]")
.setSparkHome("C:\\Spark\\spark-1.5.1-bin-hadoop2.6")
.set("spark.executor.memory", "2g");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaDStream<String> dataStream = jssc.textFileStream("C://testStream//copy.csv");
dataStream.print();
jssc.start();
jssc.awaitTermination();
}
UPDATE: The content of copy.csv is
0,0,12,5,0
0,0,12,5,0
0,1,2,0,42
0,0,0,0,264
0,0,12,5,0
textFileStream is for Monitoring the hadoop Compatible Directories. This operation will watch the provided directory and as you add new files in the provided directory it will read/ stream the data from the newly added files.
You cannot read text/ csv files using textFileStream or rather I would say that you do not need streaming in case you are just reading the files.
My Suggestion would be to monitor some directory (may be HDFS or local file system) and then add files and capture the content of these new files using textFileStream.
May be in your code may be you can replace "C://testStream//copy.csv" with C://testStream" and once your Spark Streaming job is up and running then add file copy.csv to C://testStream folder and see the output on Spark Console.
OR
may be you can write another command line Scala/ Java program which read the files and throw the content over the Socket (at a certain PORT#) and next you can leverage socketTextStream for capturing and reading the data. Once you have read the data, you further apply other transformation or output operations.
You can also think of leveraging Flume too
Refer to API Documentation for more details
This worked for me on Windows 7 and Spark 1.6.3: (removing the rest of code, important one is how to define the folder to monitor)
val ssc = ...
val lines = ssc.textFileStream("file:///D:/tmp/data")
...
print
...
This monitors directory D:/tmp/data, ssc is my streaming context
Steps:
Create a file say 1.txt in D:/tmp/data
Enter some text
Start the spart application
Rename the file to data.txt (i believe any arbitrary name will do as long as it's changed while directory is monitored by spark)
One other thing I noticed is that I had to change the line separator to Unix style (used Notepad++) otherwise file wasn't getting picked up.
Try below code, it works:
JavaDStream<String> dataStream = jssc.textFileStream("file:///C:/testStream/");

Java Hadoop FileSystem object to File object

I have a Java code sample that uploads a file to S3
File f = new File("/home/myuser/test");
TransferManager transferManager = new TransferManager(credentials);
MultipleFileUpload upload = transferManager.uploadDirectory("mybucket","test_folder",f,true);
I would actually like to upload from HDFS to S3. I don't want to do anything complicated, so I was wondering if I can use the code that I already have. So is there a way to transform a Hadoop FileSystem object to a File object? Something like this:
FileSystem fs = ... // file system from hdfs path
File f = fs.toFile()
Thanks,
Serban
There is no other way than downloading the HDFS file to your local file system if you want to use the File class. The reason is that File can only represent a local file on your HDD. However, from Java 7 onwards, you can use the Path object to obtain an input stream to your file on HDFS:
Configuration conf = new Configuration
// set the hadoop config files
conf.addResource(new Path("HADOOP_DIR/conf/core-site.xml"));
conf.addResource(new Path("HADOOP_DIR/conf/hdfs-site.xml"));
Path path = new Path("hdfs:///home/myuser/test")
FileSystem fs = path.getFileSystem(conf);
FSDataInputStream inputStream = fs.open(path)
// do what ever you want with the stream
fs.close();

Categories