I have a mapreduce Mapper. This Mapper should use some set of read-only parameters.
Let's imagine that I want to count occurences of some substrings (title of something) in input lines.
I do have a list of pairs : "some title" => "a regular expression to extract this title from input line".
These pairs are stored in usual text file.
What is the best way to pass this file to Mapper?
I have only this idea:
Upload file with pairs to hdfs.
Pass path to file using -Dpath.to.file.with.properties
in static{} section of mapper read file and populate map pair "some title" => "regular expr for the title".
Is it good or bad? please adivce
You're on track, but I would recommend using the distributed cache. Its purpose is for exactly this - passing read-only files to task nodes.
Put file in HDFS
Add that file to the distributed cache in the main method of your application.
In the Mapper class, override either the configure or setup method depending on which version of the API you are using. In that method it can read from the distributed cache and store everything in memory.
Here is a part of my code.
See the script that copies files to HDFS and launches mr-job. I do upload this script to hadoop node during maven intergation-test phase using ant: scp, ssh targets.
#dummy script for running mr-job
hadoop fs -rm -r /HttpSample/output
hadoop fs -rm -r /HttpSample/metadata.csv
hadoop fs -rm -r /var/log/hadoop-yarn/apps/cloudera/logs
#hadoop hadoop dfs -put /home/cloudera/uploaded_jars/metadata.csv /HttpSample/metadata.csv
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/metadata.csv /HttpSample/metadata.csv
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/opencsv.jar /HttpSample/opencsv.jar
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/gson.jar /HttpSample/gson.jar
#Run mr job
cd /home/cloudera/uploaded_jars
#hadoop jar scoring-job.jar ru.megalabs.mapreduce.scoringcounter.Main -libjars gson.jar -files hdfs://0.0.0.0:8020/HttpSample/metadata.csv -libjars hdfs://0.0.0.0:8020/HttpSample/opencsv.jar, hdfs://0.0.0.0:8020/HttpSample/gson.jar /HttpSample/raw_traffic.json /HttpSample/output/scoring_result
hadoop jar scoring-job.jar ru.megalabs.mapreduce.scoringcounter.Main -files hdfs://0.0.0.0:8020/HttpSample/metadata.csv -libjars hdfs://0.0.0.0:8020/HttpSample/opencsv.jar,hdfs://0.0.0.0:8020/HttpSample/gson.jar /HttpSample/raw_traffic.json /HttpSample/output/scoring_result
And the code inside Mapper:
public class ScoringCounterMapper extends Mapper<LongWritable, Text, GetReq, IntWritable> {
private static final Log LOG = LogFactory.getLog(ScoringCounterMapper.class);
private static final String METADATA_CSV = "metadata.csv";
private List<RegexMetadata> regexMetadatas = null;
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//bal-bla-lba
}
#Override
protected void setup(Context context) throws IOException, InterruptedException {
MetadataCsvReader metadataCsvReader = new MetadataCsvReader(new File(METADATA_CSV));
regexMetadatas = metadataCsvReader.getMetadata();
for(RegexMetadata rm : regexMetadatas){
LOG.info(rm);
}
}
}
See that:
1. I do upload my metadata file to node
2. I do put it to HDFS
3. I do provide path to file using -Files argument
4. I do specify that this file is inside HDFS (hdfs://0.0.0.0:8020/HttpSample/metadata.csv)
Related
I want to encrypt the text file using AxCrypt. I did it through cmd but I want it from the Java program. The following 4 commands I've used in cmd.
1: location of axcrypt software directory
2: axcrypt command (which will be executed for encrypting the file)
3: import file location (file to encrypt)
4: export file location (encrypted file directory)
Here is my code:
public class TestCode {
String axcryptLocation = "C:\\Program Files\\Axantum\\AxCrypt";
String axcryptCommand = "AxCrypt.exe -e -k \"X2U4qPtdMTMZ K63D ABnS 3gO2 PHFL XKJ/ +UsZ /QuG yp5s X78k 2wH=\" -z";
String fileImportLocation = "E:\\ImportExport\\firstcheck.txt";
String fileExportLocation = "E:\\ImportExport\\";
public static void main(String[] args) {
}
}
You need the ProcessBuilder class. It's a bit tricky to use - for example, you should replace axcryptCommand with a list of arguments (because splitting on space but not on quotes - that's bash/cmd.exe stuff). The command itself should be an absolute path, as well.
Here is a tutorial on ProcessBuilder.
NB: Note that it is not particularly complicated to encrypt stuff in java code, without having to rely on windows-only executables. A search on the web should find you loads of tutorials on how to do this, though, as is usual with crypto, you'll probably mess something up. Such is the nature of doing a task where failure cannot be tested for.
I'm trying to copy a file from local to hdfs in these three ways:
FileSystem fs = FileSystem.get(context.getConfiguration());
LocalFileSystem lfs = fs.getLocal(context.getConfiguration());
lfs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
fs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
fs.copyFromLocalFile(new Path("file:///pathToFile/file.properties"), new Path("/destPath/"));
But none of them are working.
I always get a FileNotFound exception for /pathToFile/file.properties, but the file exists on that path on Unix and has read and write permissions for the user that runs the Map/Reduce.
Any ideas what I'm missing here?
Job is running with Ozzie
CDH4
Thank you very much for your help.
opalo
Where is this code running?
If this code is running in a map or reduce method (as it seems because you have a Context instance), then you are executing on one of your slave nodes. Can all of your slave nodes see this path or can only the cluster's login node see the file?
If this code is in fact supposed to be running in a mapper or reducer, and the file is not local to these machines (and you do not want to put the file(s) into hdfs with a "hdfs fs -put" command), one option you have is to deploy the file(s) with your job using the hadoop distributed cache. You can do this programmatically using the DistributedCache class's static method addCacheFile, or through the command line if your main class implements the Tool interface by using the -files switch.
Programmatically (Copied from the documentation linked to above):
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), ob);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
From the command line if your main class implements Tool interface:
hadoop jar Your.jar Package.Path.To.MainClass -files comma,seperated,list,of,files program_argument_list_here
I have an hadoop job that outputs many parts to hdfs for example to some folder.
For example:
/output/s3/2014-09-10/part...
What is the best way, using s3 java api to upload those parts to signle file in s3
For example
s3:/jobBucket/output-file-2014-09-10.csv
As a possible solution there is an option to merge the parts and write the result to hdfs single file, but this will create a double I/O.
Using single reducer is not option as well
Thanks,
Try to use FileUtil#copyMerge method, it allows you to copy data between two file systems. Also i found S3DistCp tool that can copy data from HDFS to Amazon S3. You can specify --groupBy,(.*) option to merge the files.
Snippet for Spark process
void sparkProcess(){
SparkConf sparkConf = new SparkConf().setAppName("name");
JavaSparkContext sc = new JavaSparkContext(sparkConf)
Configuration hadoopConf = sc.hadoopConfiguration();
hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
String folderPath = "s3://bucket/output/folder";
String mergedFilePath = "s3://bucket/output/result.txt";
BatchFileUtil.copyMerge(hadoopConf, folderPath, mergedFilePath);
}
public static boolean copyMerge(Configuration hadoopConfig, String srcPath, String dstPath) throws IOException, URISyntaxException {
FileSystem hdfs = FileSystem.get(new URI(srcPath), hadoopConfig);
return FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null);
}
Use the java hdfs api to read the files, then use standard Java streamy type stuff to convert to a InputStream, then use
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/PutObjectRequest.html
See also
https://stackoverflow.com/a/11116119/1586965
The issue I'm having is that the hadoop jar command requires an input path, but my MapReduce job gets its input from a database and hence doesn't need/have an input directory. I've set the JobConf inputformat to DBInputFormat, but how do I signify this when jarring my job?
//Here is the command
hadoop jar <my-jar> <hdfs input> <hdfs output>
I have an output folder, but don't need an input folder. Is there a way to circumvent this? Do I need to write a second program that pulls the DB data into a folder and then use that in the MapReduce job?
The hadoop jar command requires no command line arguments, other than maybe the main class. The command line arguments for your map/reduce job will be decided by the program itself. So if it no longer requires an HDFS input path, then you would need to change the code to not require that.
public class MyJob extends Configured implements Tool
{
public void run(String[] args) throws Exception {
// ...
TextInputFormat.setInputPaths(job, new Path(args[0])); // or some other file input format
TextOutputFormat.setOutputPath(job, new Path(args[1]));
}
}
So you would remove the input path statement. There is no magic in JAR'ing the job up, just change the InputFormat (which you said you did) and you should be set.
I am trying to Compress and Archive all the files in a folder, using Java Runtime class. My code snippet looks as this :
public static void compressFileRuntime() throws IOException, InterruptedException {
String date = Util.getDateAsString("yyyy-MM-dd");
Runtime rt = Runtime.getRuntime();
String archivedFile = "myuserData"+date+".tar.bz2";
String command = "tar --remove-files -cjvf "+archivedFile+" marketData*";
File f = new File("/home/amit/Documents/");
Process pr = rt.exec(command, null, f);
System.out.println("Exit value: "+pr.exitValue());
}
The above code doesn't archive and compress the file as expected, though it creates a file myuserData2009-11-18.tar.bz2 in the folder "/home/amit/Documents/".
Also the output is
Exit value: 2.
While if I execute the same command from command line, it gives the expected result.
Please tell me what I am missing.
Thanks
Amit
The problem lies in this part:
" marketData*"
you expect the filenames to be compressed to be globbed from the * wildcard. Globbing is done by the shell, not by the tools themselves. your choices are to either:
numerate the files to be archived yourself
start the shell to perform the command ("/bin/sh -c")
start tar on the folder containing the files to be archived
Edit:
For the shell option, your command would look like:
String command = "sh -c \"tar --remove-files -cjvf "+archivedFile+" marketData*\"";
(mind the \"s that delimit the command to be executed by the shell, don't use single quotes ot the shell won't interpret the glob.)
If really you want to create a bzip2 archive, I'd use a Java implementation instead of a native command which is good for portability, for example the one available at http://www.kohsuke.org/bzip2/ (it is not really optimized though, compression seems to be slower than with Java LZMA).