How to write to HDFS programmatically?

How to write to HDFS programmatically? - java

So after 36 hours of experimenting with this and that, I have finally managed to get a cluster up and running but now I am confused how I can write files to it using Java? A tutorial said this program should be used but I don't understand it at all and it doesn't work as well.
public class FileWriteToHDFS {
public static void main(String[] args) throws Exception {
//Source file in the local file system
String localSrc = args[0];
//Destination file in HDFS
String dst = args[1];
//Input stream for the file in local file system to be written to HDFS
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
//Get configuration of Hadoop system
Configuration conf = new Configuration();
System.out.println("Connecting to -- "+conf.get("fs.defaultFS"));
//Destination file in HDFS
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst));
//Copy file from local to HDFS
IOUtils.copyBytes(in, out, 4096, true);
System.out.println(dst + " copied to HDFS");
}
}
My confusion is how does this piece of code identify specifics of my cluster? How will it know where the masternode is and where the slavenodes are?
Furthermore when I run this code and provide some local file in source and leave destination blank/or provide a file name only the program writes the file back to my local storage and not the location that I defined as storage space for my namenodes and datanodes. Should I be providing this path manually? How does this work? Please suggest some blog that can help me understand it better or can get working with a smallest example.

First off, you'll need to add some Hadoop libraries to your classpath. Without those, no, that code won't work.
How will it know where the masternode is and where the slavenodes are?
From the new Configuration(); and subsequent conf.get("fs.defaultFS").
It reads the core-site.xml of the HADOOP_CONF_DIR environment variable and returns the address of the namenode. The client only needs to talk to the namenode to receive the locations of the datanodes, from which file blocks are written to
the program writes the file back to my local storage
It's not clear where you've configured the filesystem, but the default is file://, your local disk. You change this in the core-site.xml. If you follow the Hadoop documentation, the pseudo distributed cluster setup mentions this
It's also not very clear why you need your own Java code when simply hdfs dfs -put will do the same thing

Related

Permission denied accessing HDFS via Hadoop java API

As part of a jar ran through hadoop, I want to implement a simple function that (a) creates a file if it doesn't exist, (b) appends bytes from a string passed in on a new line into this file.
I wrote the following:
public class FSFacade {
private static FileContext fc = FileCOntext.getFileContext();
public static void appendRawText(Path p, String data) throws IOException {
InputStream is
= new ByteArrayInputStream(data.getBytes(StandardCharsets.UTF_8));
FsPermission permissions
= new FsPermission(FsAction.ALL, FsAction.ALL, FsAction.ALL);
OutputStream os
= fc.create(p,
EnumSet.of(CREATE, APPEND),
CreateOpts.perms(permissions),
CreateOpts.createParents());
IOUtils.copyBytes(is, os, new Configuration());
}
}
This code works fine in Eclipse, but when I try and run it on an HDFS via hadoop jar it raises either of the following exceptions:
java.io.FileNotFoundException: /out (Permission denied)
java.io.FileNotFoundException: /results/out (no such file or directory)
I assume the first one is raised because my process doesn't have permissions to write to the root of the HDFS. The second one probably means my code somehow doesn't create the file if it doesn't exist yet.
How can I make sure, programatically, that my process
(a) has all the appropriate permissions to write into the Path passed in ? (I presume it means execute perms on all folders in the path and write perms on the last one ?)
(b) indeed creates the file if it doesn't exist yet, as I expected EnumSet.of(CREATE, APPEND) to do ?

You can use the following command to give permission to write into HDFS
> hdfs dfs -chmod -R 777 /*
* means permissions will be enabled for all folders
777 will enable all permissions (read , write and execute)
Hope it helps !!

Reading a temporary properties file stored on Jenkins by the Credential Binding Plugin

I use the Secret File functionality of the Credentials Binding Plugin on Jenkins:
Copies the file given in the credentials to a temporary location, then sets the variable to that location. (The file is deleted when the build completes.)
and I'm trying the following thing:
String propertiesTempFilepath = "/" + System.getenv(envVariable);
InputStream input = getClass().getClassLoader().getResourceAsStream(propertiesTempFilepath);
In the end, input remains null, and I'm left with a NullPointerException when I try to load a Properties object with it. Where does the plugin actually store the properties file and can I access it with the getResourceAsStream method or Java code at all?
(I would actually appreciate any advice to load that properties variable from a secret file, but I'm not familiar with writing/running shell scripts, so the main tutorial confused me)

Solved it myself by using FileInputStream instead:
FileInputStream input = new FileInputStream(new File(propertiesEnvName));

Print the content of streams (Spark streaming) in Windows system

I want just to print the content of streams to console. I wrote the following code but it does not print anything. Anyone can help me to read text file as stream in Spark?? Is there a problem related to Windows system?
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("My app")
.setMaster("local[2]")
.setSparkHome("C:\\Spark\\spark-1.5.1-bin-hadoop2.6")
.set("spark.executor.memory", "2g");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaDStream<String> dataStream = jssc.textFileStream("C://testStream//copy.csv");
dataStream.print();
jssc.start();
jssc.awaitTermination();
}
UPDATE: The content of copy.csv is
0,0,12,5,0
0,0,12,5,0
0,1,2,0,42
0,0,0,0,264
0,0,12,5,0

textFileStream is for Monitoring the hadoop Compatible Directories. This operation will watch the provided directory and as you add new files in the provided directory it will read/ stream the data from the newly added files.
You cannot read text/ csv files using textFileStream or rather I would say that you do not need streaming in case you are just reading the files.
My Suggestion would be to monitor some directory (may be HDFS or local file system) and then add files and capture the content of these new files using textFileStream.
May be in your code may be you can replace "C://testStream//copy.csv" with C://testStream" and once your Spark Streaming job is up and running then add file copy.csv to C://testStream folder and see the output on Spark Console.
OR
may be you can write another command line Scala/ Java program which read the files and throw the content over the Socket (at a certain PORT#) and next you can leverage socketTextStream for capturing and reading the data. Once you have read the data, you further apply other transformation or output operations.
You can also think of leveraging Flume too
Refer to API Documentation for more details

This worked for me on Windows 7 and Spark 1.6.3: (removing the rest of code, important one is how to define the folder to monitor)
val ssc = ...
val lines = ssc.textFileStream("file:///D:/tmp/data")
...
print
...
This monitors directory D:/tmp/data, ssc is my streaming context
Steps:
Create a file say 1.txt in D:/tmp/data
Enter some text
Start the spart application
Rename the file to data.txt (i believe any arbitrary name will do as long as it's changed while directory is monitored by spark)
One other thing I noticed is that I had to change the line separator to Unix style (used Notepad++) otherwise file wasn't getting picked up.

Try below code, it works:
JavaDStream<String> dataStream = jssc.textFileStream("file:///C:/testStream/");

Copy file from local

I'm trying to copy a file from local to hdfs in these three ways:
FileSystem fs = FileSystem.get(context.getConfiguration());
LocalFileSystem lfs = fs.getLocal(context.getConfiguration());
lfs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
fs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
fs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
But none of them are working.
I always get a FileNotFound exception for /pathToFile/file.properties, but the file exists on that path on Unix and has read and write permissions for the user that runs the Map/Reduce.
Any ideas what I'm missing here?
Job is running with Ozzie
CDH4
Thank you very much for your help.
opalo

Where is this code running?
If this code is running in a map or reduce method (as it seems because you have a Context instance), then you are executing on one of your slave nodes. Can all of your slave nodes see this path or can only the cluster's login node see the file?
If this code is in fact supposed to be running in a mapper or reducer, and the file is not local to these machines (and you do not want to put the file(s) into hdfs with a "hdfs fs -put" command), one option you have is to deploy the file(s) with your job using the hadoop distributed cache. You can do this programmatically using the DistributedCache class's static method addCacheFile, or through the command line if your main class implements the Tool interface by using the -files switch.
Programmatically (Copied from the documentation linked to above):
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), ob);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
From the command line if your main class implements Tool interface:
hadoop jar Your.jar Package.Path.To.MainClass -files comma,seperated,list,of,files program_argument_list_here

Where does Java File instance store?

I have a question regarding to the Java File class. When I create a File instance, for example,
File aFile = new File(path);
Where does the instance aFile store in the computer? Or it stores in JVM? I mean is there a temp file stored in the local disk?
If I have an InputStream instance, and write it to a file by using OutputSteam, for example
File aFile = new File("test.txt");
OutputStream anOutputStream = new FileOutputStream(aFile);
byte aBuffer[] = new byte[1024];
while( ( iLength = anInputStream.read( aBuffer ) ) > 0)
{
anOutputStream.write( aBuffer, 0, iLength);
}
Now where does the file test.txt store?
Thanks in advance!

A File object isn't a real file at all - it's really just a filename/location, and methods which hook into the file system to check whether or not the file really exists etc. There's no content directly associated with the File instance - it's not like it's a virtual in-memory file, for example. The instance itself is just an object in memory like any other object.
Creating a File instance on its own does nothing to the file system.
When you create a FileOutputStream, however, that does affect whatever file system you're writing to. The File instance is relatively irrelevant though - you'd get the same effect from:
OutputStream anOutputStream = new FileOutputStream("test.txt");

It will write the file where you specify it with path arguement.
In your case, it will write it in the directory where you run your java class.
If you specify /test/myproject/myfile.txt
it will go in /test/myproject/myfile.txt

If you don't provide a path, it is in the current directory (ie: the directory where java.exe is executed from.) If you provide a full path, it is stored there.
Regardless, it is always stored in the filesystem, not in JVM memory.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to write to HDFS programmatically? - java

Related

Permission denied accessing HDFS via Hadoop java API

Reading a temporary properties file stored on Jenkins by the Credential Binding Plugin

Print the content of streams (Spark streaming) in Windows system

Copy file from local

Where does Java File instance store?

Categories

Resources