I'm trying to copy a file from local to hdfs in these three ways:
FileSystem fs = FileSystem.get(context.getConfiguration());
LocalFileSystem lfs = fs.getLocal(context.getConfiguration());
lfs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
fs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
fs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
But none of them are working.
I always get a FileNotFound exception for /pathToFile/file.properties, but the file exists on that path on Unix and has read and write permissions for the user that runs the Map/Reduce.
Any ideas what I'm missing here?
Job is running with Ozzie
CDH4
Thank you very much for your help.
opalo
Where is this code running?
If this code is running in a map or reduce method (as it seems because you have a Context instance), then you are executing on one of your slave nodes. Can all of your slave nodes see this path or can only the cluster's login node see the file?
If this code is in fact supposed to be running in a mapper or reducer, and the file is not local to these machines (and you do not want to put the file(s) into hdfs with a "hdfs fs -put" command), one option you have is to deploy the file(s) with your job using the hadoop distributed cache. You can do this programmatically using the DistributedCache class's static method addCacheFile, or through the command line if your main class implements the Tool interface by using the -files switch.
Programmatically (Copied from the documentation linked to above):
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), ob);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
From the command line if your main class implements Tool interface:
hadoop jar Your.jar Package.Path.To.MainClass -files comma,seperated,list,of,files program_argument_list_here
Related
So after 36 hours of experimenting with this and that, I have finally managed to get a cluster up and running but now I am confused how I can write files to it using Java? A tutorial said this program should be used but I don't understand it at all and it doesn't work as well.
public class FileWriteToHDFS {
public static void main(String[] args) throws Exception {
//Source file in the local file system
String localSrc = args[0];
//Destination file in HDFS
String dst = args[1];
//Input stream for the file in local file system to be written to HDFS
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
//Get configuration of Hadoop system
Configuration conf = new Configuration();
System.out.println("Connecting to -- "+conf.get("fs.defaultFS"));
//Destination file in HDFS
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst));
//Copy file from local to HDFS
IOUtils.copyBytes(in, out, 4096, true);
System.out.println(dst + " copied to HDFS");
}
}
My confusion is how does this piece of code identify specifics of my cluster? How will it know where the masternode is and where the slavenodes are?
Furthermore when I run this code and provide some local file in source and leave destination blank/or provide a file name only the program writes the file back to my local storage and not the location that I defined as storage space for my namenodes and datanodes. Should I be providing this path manually? How does this work? Please suggest some blog that can help me understand it better or can get working with a smallest example.
First off, you'll need to add some Hadoop libraries to your classpath. Without those, no, that code won't work.
How will it know where the masternode is and where the slavenodes are?
From the new Configuration(); and subsequent conf.get("fs.defaultFS").
It reads the core-site.xml of the HADOOP_CONF_DIR environment variable and returns the address of the namenode. The client only needs to talk to the namenode to receive the locations of the datanodes, from which file blocks are written to
the program writes the file back to my local storage
It's not clear where you've configured the filesystem, but the default is file://, your local disk. You change this in the core-site.xml. If you follow the Hadoop documentation, the pseudo distributed cluster setup mentions this
It's also not very clear why you need your own Java code when simply hdfs dfs -put will do the same thing
I want just to print the content of streams to console. I wrote the following code but it does not print anything. Anyone can help me to read text file as stream in Spark?? Is there a problem related to Windows system?
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("My app")
.setMaster("local[2]")
.setSparkHome("C:\\Spark\\spark-1.5.1-bin-hadoop2.6")
.set("spark.executor.memory", "2g");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaDStream<String> dataStream = jssc.textFileStream("C://testStream//copy.csv");
dataStream.print();
jssc.start();
jssc.awaitTermination();
}
UPDATE: The content of copy.csv is
0,0,12,5,0
0,0,12,5,0
0,1,2,0,42
0,0,0,0,264
0,0,12,5,0
textFileStream is for Monitoring the hadoop Compatible Directories. This operation will watch the provided directory and as you add new files in the provided directory it will read/ stream the data from the newly added files.
You cannot read text/ csv files using textFileStream or rather I would say that you do not need streaming in case you are just reading the files.
My Suggestion would be to monitor some directory (may be HDFS or local file system) and then add files and capture the content of these new files using textFileStream.
May be in your code may be you can replace "C://testStream//copy.csv" with C://testStream" and once your Spark Streaming job is up and running then add file copy.csv to C://testStream folder and see the output on Spark Console.
OR
may be you can write another command line Scala/ Java program which read the files and throw the content over the Socket (at a certain PORT#) and next you can leverage socketTextStream for capturing and reading the data. Once you have read the data, you further apply other transformation or output operations.
You can also think of leveraging Flume too
Refer to API Documentation for more details
This worked for me on Windows 7 and Spark 1.6.3: (removing the rest of code, important one is how to define the folder to monitor)
val ssc = ...
val lines = ssc.textFileStream("file:///D:/tmp/data")
...
print
...
This monitors directory D:/tmp/data, ssc is my streaming context
Steps:
Create a file say 1.txt in D:/tmp/data
Enter some text
Start the spart application
Rename the file to data.txt (i believe any arbitrary name will do as long as it's changed while directory is monitored by spark)
One other thing I noticed is that I had to change the line separator to Unix style (used Notepad++) otherwise file wasn't getting picked up.
Try below code, it works:
JavaDStream<String> dataStream = jssc.textFileStream("file:///C:/testStream/");
I'm using Hadoop 2.7.1 and coding in Java. I'm able to run a simple mapreduce program where I provide a folder as input to the MapReduce program.
However I want to use a file (full paths are inside ) as input; this file contains all the other files to be processed by the mapper function.
Below is the file content,
/allfiles.txt
- /tmp/aaa/file1.txt
- /tmp/bbb/file2.txt
- /tmp/ccc/file3.txt
How can I specify the input path to MapReduce program as a file , so that it can start processing each file inside ? thanks.
In your driver class, you can read in the file, and add each line as a file for input:
//Read allfiles.txt and put each line into a List (requires at least Java 1.7)
List<String> files = Files.readAllLines(Paths.get("allfiles.txt"), StandardCharsets.UTF_8);
/Loop through the file names and add them as input
for(String file : files) {
//This Path is org.apache.hadoop.fs.Path
FileInputFormat.addInputPath(conf, new Path(file));
}
This is assuming that your allfiles.txt is local to the node on which your MR job is being run, but it's only a small change if allfiles.txt is actually on the HDFS.
I strongly recommended that you check that each file exists on the HDFS before you add it as input.
Instead of creating a file with path to other files, you could use globs.
In your example, you could have defined your inputs as -input /tmp/*/file?.txt
I am running a java application (app1) that runs multiple instances of another java application (app2) in parallel using ProcessBuilder and Fixed Thread Pool with the following code:
futures = new ArrayList<Future<String>>();
executor = Executors.newFixedThreadPool(10);
for (int i=0; i<cnt;i++)
{
ProcessBuilder builder = new ProcessBuilder();
builder.command("java","app2"); //I set the class path in the real code
builder.directory(new File("/Users/Evan/Documents/workspace/MyProj/ins_"+i));
builder.inheritIO();
ProcessRunner pr=new ProcessRunner (builder);
futures.add(executor.submit(pr));
}
Each run of app2 has different working directory (the value of i in the folder ins_i is different for each run). The code works, but now in app2, I want to read a file from the working directory of the current process (i.e. /Users/Evan/Documents/workspace/MyProj/ins_"+i). So in app2 code I need a statement that let me know the directory of the current process builder (basically the value of i in "/Users/Evan/Documents/workspace/MyProj/ins_"+i).
I tried this statement:
System.getProperty("user.dir")
and it did not work because it retrieved "/Users/Evan/Documents/workspace/MyProj"
Thank you and I appreciate any help
but now in app2, I want to read a file from the working directory of the current process
If you know the filename you want to read from within the working directory of app2, you can just create a File object, passing the file name of the file you wish to read. The parent folder defaults to the current working directory (which is defined by the ProcessBuilder)
File myFile = new File("FileToRead");
//read from myFile using FileInputStream
The issue I'm having is that the hadoop jar command requires an input path, but my MapReduce job gets its input from a database and hence doesn't need/have an input directory. I've set the JobConf inputformat to DBInputFormat, but how do I signify this when jarring my job?
//Here is the command
hadoop jar <my-jar> <hdfs input> <hdfs output>
I have an output folder, but don't need an input folder. Is there a way to circumvent this? Do I need to write a second program that pulls the DB data into a folder and then use that in the MapReduce job?
The hadoop jar command requires no command line arguments, other than maybe the main class. The command line arguments for your map/reduce job will be decided by the program itself. So if it no longer requires an HDFS input path, then you would need to change the code to not require that.
public class MyJob extends Configured implements Tool
{
public void run(String[] args) throws Exception {
// ...
TextInputFormat.setInputPaths(job, new Path(args[0])); // or some other file input format
TextOutputFormat.setOutputPath(job, new Path(args[1]));
}
}
So you would remove the input path statement. There is no magic in JAR'ing the job up, just change the InputFormat (which you said you did) and you should be set.