The issue I'm having is that the hadoop jar command requires an input path, but my MapReduce job gets its input from a database and hence doesn't need/have an input directory. I've set the JobConf inputformat to DBInputFormat, but how do I signify this when jarring my job?
//Here is the command
hadoop jar <my-jar> <hdfs input> <hdfs output>
I have an output folder, but don't need an input folder. Is there a way to circumvent this? Do I need to write a second program that pulls the DB data into a folder and then use that in the MapReduce job?
The hadoop jar command requires no command line arguments, other than maybe the main class. The command line arguments for your map/reduce job will be decided by the program itself. So if it no longer requires an HDFS input path, then you would need to change the code to not require that.
public class MyJob extends Configured implements Tool
{
public void run(String[] args) throws Exception {
// ...
TextInputFormat.setInputPaths(job, new Path(args[0])); // or some other file input format
TextOutputFormat.setOutputPath(job, new Path(args[1]));
}
}
So you would remove the input path statement. There is no magic in JAR'ing the job up, just change the InputFormat (which you said you did) and you should be set.
Related
I'm using Hadoop 2.7.1 and coding in Java. I'm able to run a simple mapreduce program where I provide a folder as input to the MapReduce program.
However I want to use a file (full paths are inside ) as input; this file contains all the other files to be processed by the mapper function.
Below is the file content,
/allfiles.txt
- /tmp/aaa/file1.txt
- /tmp/bbb/file2.txt
- /tmp/ccc/file3.txt
How can I specify the input path to MapReduce program as a file , so that it can start processing each file inside ? thanks.
In your driver class, you can read in the file, and add each line as a file for input:
//Read allfiles.txt and put each line into a List (requires at least Java 1.7)
List<String> files = Files.readAllLines(Paths.get("allfiles.txt"), StandardCharsets.UTF_8);
/Loop through the file names and add them as input
for(String file : files) {
//This Path is org.apache.hadoop.fs.Path
FileInputFormat.addInputPath(conf, new Path(file));
}
This is assuming that your allfiles.txt is local to the node on which your MR job is being run, but it's only a small change if allfiles.txt is actually on the HDFS.
I strongly recommended that you check that each file exists on the HDFS before you add it as input.
Instead of creating a file with path to other files, you could use globs.
In your example, you could have defined your inputs as -input /tmp/*/file?.txt
I'm trying to copy a file from local to hdfs in these three ways:
FileSystem fs = FileSystem.get(context.getConfiguration());
LocalFileSystem lfs = fs.getLocal(context.getConfiguration());
lfs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
fs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
fs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
But none of them are working.
I always get a FileNotFound exception for /pathToFile/file.properties, but the file exists on that path on Unix and has read and write permissions for the user that runs the Map/Reduce.
Any ideas what I'm missing here?
Job is running with Ozzie
CDH4
Thank you very much for your help.
opalo
Where is this code running?
If this code is running in a map or reduce method (as it seems because you have a Context instance), then you are executing on one of your slave nodes. Can all of your slave nodes see this path or can only the cluster's login node see the file?
If this code is in fact supposed to be running in a mapper or reducer, and the file is not local to these machines (and you do not want to put the file(s) into hdfs with a "hdfs fs -put" command), one option you have is to deploy the file(s) with your job using the hadoop distributed cache. You can do this programmatically using the DistributedCache class's static method addCacheFile, or through the command line if your main class implements the Tool interface by using the -files switch.
Programmatically (Copied from the documentation linked to above):
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), ob);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
From the command line if your main class implements Tool interface:
hadoop jar Your.jar Package.Path.To.MainClass -files comma,seperated,list,of,files program_argument_list_here
I have a mapreduce Mapper. This Mapper should use some set of read-only parameters.
Let's imagine that I want to count occurences of some substrings (title of something) in input lines.
I do have a list of pairs : "some title" => "a regular expression to extract this title from input line".
These pairs are stored in usual text file.
What is the best way to pass this file to Mapper?
I have only this idea:
Upload file with pairs to hdfs.
Pass path to file using -Dpath.to.file.with.properties
in static{} section of mapper read file and populate map pair "some title" => "regular expr for the title".
Is it good or bad? please adivce
You're on track, but I would recommend using the distributed cache. Its purpose is for exactly this - passing read-only files to task nodes.
Put file in HDFS
Add that file to the distributed cache in the main method of your application.
In the Mapper class, override either the configure or setup method depending on which version of the API you are using. In that method it can read from the distributed cache and store everything in memory.
Here is a part of my code.
See the script that copies files to HDFS and launches mr-job. I do upload this script to hadoop node during maven intergation-test phase using ant: scp, ssh targets.
#dummy script for running mr-job
hadoop fs -rm -r /HttpSample/output
hadoop fs -rm -r /HttpSample/metadata.csv
hadoop fs -rm -r /var/log/hadoop-yarn/apps/cloudera/logs
#hadoop hadoop dfs -put /home/cloudera/uploaded_jars/metadata.csv /HttpSample/metadata.csv
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/metadata.csv /HttpSample/metadata.csv
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/opencsv.jar /HttpSample/opencsv.jar
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/gson.jar /HttpSample/gson.jar
#Run mr job
cd /home/cloudera/uploaded_jars
#hadoop jar scoring-job.jar ru.megalabs.mapreduce.scoringcounter.Main -libjars gson.jar -files hdfs://0.0.0.0:8020/HttpSample/metadata.csv -libjars hdfs://0.0.0.0:8020/HttpSample/opencsv.jar, hdfs://0.0.0.0:8020/HttpSample/gson.jar /HttpSample/raw_traffic.json /HttpSample/output/scoring_result
hadoop jar scoring-job.jar ru.megalabs.mapreduce.scoringcounter.Main -files hdfs://0.0.0.0:8020/HttpSample/metadata.csv -libjars hdfs://0.0.0.0:8020/HttpSample/opencsv.jar,hdfs://0.0.0.0:8020/HttpSample/gson.jar /HttpSample/raw_traffic.json /HttpSample/output/scoring_result
And the code inside Mapper:
public class ScoringCounterMapper extends Mapper<LongWritable, Text, GetReq, IntWritable> {
private static final Log LOG = LogFactory.getLog(ScoringCounterMapper.class);
private static final String METADATA_CSV = "metadata.csv";
private List<RegexMetadata> regexMetadatas = null;
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//bal-bla-lba
}
#Override
protected void setup(Context context) throws IOException, InterruptedException {
MetadataCsvReader metadataCsvReader = new MetadataCsvReader(new File(METADATA_CSV));
regexMetadatas = metadataCsvReader.getMetadata();
for(RegexMetadata rm : regexMetadatas){
LOG.info(rm);
}
}
}
See that:
1. I do upload my metadata file to node
2. I do put it to HDFS
3. I do provide path to file using -Files argument
4. I do specify that this file is inside HDFS (hdfs://0.0.0.0:8020/HttpSample/metadata.csv)
I am trying to add a text file to a zip archive through a Java program on Linux. The program spawns a process (using java.lang.Process) to execute the commandline "zip -j .zip .txt", reads the output and error streams of the spawned process and waits for the process to complete using waitFor(). Though the program seems to run fine (spawned process exits with exit code 0, indicating that the zip commandline was executed successfully) and the output read from output and error streams do not indicate any errors, at the end of the program the zip archive doesn't always contain the file supposed to have been added. This problem doesn't happen consistently though (even with the same existing-archive and file-to-add) - once in a while (perhaps once in 4 attempts) the zip is found to have been updated correctly. Strangely, the problem doesn't occur at all when the program is run through Eclipse debugger mode. Any pointers on why this problem occurs and how it can be addressed would be helpful. Thanks!
Below is the code snippet. The program calls addFileToZip(File, File, String):
public static void addFileToZip(final File zipFile, final File fileToBeAdded,
final String fileNameToBeAddedAs) throws Exception {
File tempDir = createTempDir();
File fileToBeAddedAs = new File(tempDir, fileNameToBeAddedAs);
try {
FileUtils.copyFile(fileToBeAdded, fileToBeAddedAs);
addFileToZip(zipFile, fileToBeAddedAs);
} finally {
deleteFile(fileToBeAddedAs);
deleteFile(tempDir);
}
}
public static void addFileToZip(final File zipFile, final File fileToBeAdded) throws Exception {
final String[] command = {"zip", "-j", zipFile.getAbsolutePath(), fileToBeAdded.getAbsolutePath()};
ProcessBuilder procBuilder = new ProcessBuilder(command);
Process proc = procBuilder.start();
int exitCode = proc.waitFor();
/*
* Code to read output/error streams of proc and log/print them
*/
if (exitCode != 0) {
throw new Exception("Unable to add file, error: " + errMsg);
}
}
Make sure no other process has the zip file locked for write, or the file being added locked for read. If you're generating the file to be added, make sure the stream is flushed and closed before spawning the zip utility.
I am trying to add a text file to a zip archive through a Java program on Linux.
Use the java.util.zip API, which:
Provides classes for reading and writing the standard ZIP and GZIP file formats.
If you intend to stick with using a Process to do this, be sure to implement all the suggestions of When Runtime.exec() won't.
I am trying to Compress and Archive all the files in a folder, using Java Runtime class. My code snippet looks as this :
public static void compressFileRuntime() throws IOException, InterruptedException {
String date = Util.getDateAsString("yyyy-MM-dd");
Runtime rt = Runtime.getRuntime();
String archivedFile = "myuserData"+date+".tar.bz2";
String command = "tar --remove-files -cjvf "+archivedFile+" marketData*";
File f = new File("/home/amit/Documents/");
Process pr = rt.exec(command, null, f);
System.out.println("Exit value: "+pr.exitValue());
}
The above code doesn't archive and compress the file as expected, though it creates a file myuserData2009-11-18.tar.bz2 in the folder "/home/amit/Documents/".
Also the output is
Exit value: 2.
While if I execute the same command from command line, it gives the expected result.
Please tell me what I am missing.
Thanks
Amit
The problem lies in this part:
" marketData*"
you expect the filenames to be compressed to be globbed from the * wildcard. Globbing is done by the shell, not by the tools themselves. your choices are to either:
numerate the files to be archived yourself
start the shell to perform the command ("/bin/sh -c")
start tar on the folder containing the files to be archived
Edit:
For the shell option, your command would look like:
String command = "sh -c \"tar --remove-files -cjvf "+archivedFile+" marketData*\"";
(mind the \"s that delimit the command to be executed by the shell, don't use single quotes ot the shell won't interpret the glob.)
If really you want to create a bzip2 archive, I'd use a Java implementation instead of a native command which is good for portability, for example the one available at http://www.kohsuke.org/bzip2/ (it is not really optimized though, compression seems to be slower than with Java LZMA).