How to pass a file as parameter in mapreduce - java

I want to search for particular words in a file and display its count. When the word to be searched is a single word, I am able to do it by setting the configuration in the driver like below :
Driver class :
Configuration conf = new Configuration();
conf.set("wordtosearch", "fun");
Mapper class :
public static class SearchMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
// Map code goes here.
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable Key, Text value,Context context )throws IOException,InterruptedException{
Configuration conf = context.getConfiguration();
//retrieve the wordToSearch variable
String wordToSearch = conf.get("wordtosearch");
String txt= value.toString();
if(txt.compareTo(wordToSearch)==0){
word = context.getCurrentValue();
context.getCurrentKey();
word.set(txt);
context.write(word, one);
}
But when there is a list of words in a file, I dont know how to pass it. Some posts refers to use distributed cache but while doing that I am getting "distributed cache is deprecated" error. Are there any similar methods in the new api to pass the file ?

Yes, there is also a way in the new API.
First, store the file in HDFS. Then, in the Driver class (in the main method), do the following:
Configuration conf = getConf();
...
Job job = Job.getInstance(conf); ...
job.addCacheFile(new Path(filename).toUri());
Finally, in the mapper class (for instance in the setup() method), do the following:
URI[] localPaths = context.getCacheFiles();
If you have a single file, it should be stored in localPaths[0].

You can try this:judge the parameter wether is a file, then according to the type of parameter execute the operation respectively

If the list of words has a reasonable size, you can still pass it to the configuration:
Driver class: read the file
Driver class: add the list of words in the configuration, doing for instance conf.set("wordListToSearch", "fun:foo:bar"
Mapper class: read the configuration and retrieve your list of words

Related

Common configuration 2 - commented properties

i'm developing a java desktop application to use as a tool to manipulate properties files.
So my application need to to load a TXT file containing several keys (or properties), and be able to changes there values.
I use Apache commons configurations 2 project to do this work.
Everything was going well until a face this situation..
there is a txt file contating several properties but, one properties (or key) has a # before his name (look at config.txt file below), so apache commons do not recognize this propertie.
When i set a value to this propertie (look at last two lines on main method), apache commons configurator creates a new line on config.txt
This is not good to me, because i need to preserve the original config.txt format, so if i need to set a value to fileDb propertie, my expected behavior is to remove the # from fileDb name and do not create a new fileDb propertie on config.txt.
the program that uses this configuration file (config.txt) understands that if the property is commented (if it has "#" in its name), it means that it should not be used, but if it is uncommented (without "#" in the name) means that it should be used and that its value should be read.
So what can i do to add or remove "#" from the propertie name using common configuration? if is not possible to do it with commons configurator, how to do it?
On the ConfigTest class i show how the problem ocours.
Before run the code, the file config.txt has a commented propertie named fileDB
#fileDb = C:/test/test.db
the maind method try to change the filedb value to "C://aa//bb//cc"
but after the executions end wil appears a new propetie with the same name "fileDb" but uncomented (without "#"). At the end, config.txt wil have two fileDb propertie, one with "#" and other without "#"
#fileDb = C:/test/test.db
fileDb = C://aa//bb//cc
My expectd behavior after the code execution is to has only one fileDb propetie
fileDb = C://aa//bb//cc
After be able to uncoment this propertie i expect to be able to comment again if necessery (if user wish to do it on my app).
ConfigTest class
import org.apache.commons.configuration2.Configuration;
import org.apache.commons.configuration2.FileBasedConfiguration;
import org.apache.commons.configuration2.PropertiesConfiguration;
import org.apache.commons.configuration2.builder.FileBasedConfigurationBuilder;
import org.apache.commons.configuration2.builder.fluent.FileBasedBuilderParameters;
import org.apache.commons.configuration2.builder.fluent.Parameters;
import org.apache.commons.configuration2.ex.ConfigurationException;
import java.io.File;
public class ConfigTest {
private Configuration localConfiguration;
private FileBasedConfigurationBuilder<FileBasedConfiguration> fileBuilder;
private ConfigTest() throws ConfigurationException {
this.loadPropertiesFile();
}
private void loadPropertiesFile() throws ConfigurationException {
File file = new File("config.txt");
FileBasedBuilderParameters fb = new Parameters().fileBased();
fb.setFile(file);
fileBuilder = new FileBasedConfigurationBuilder<FileBasedConfiguration>(PropertiesConfiguration.class);
fileBuilder.configure(fb);
localConfiguration = fileBuilder.getConfiguration();
}
public boolean saveModifications(){
try {
fileBuilder.save();
return true;
} catch (ConfigurationException e) {
e.printStackTrace();
return false;
}
}
public void setProperty(String key, String value){
localConfiguration.setProperty(key,value);
}
public static void main(String[] args) throws ConfigurationException {
ConfigTest test = new ConfigTest();
//change outputOnSSD propertie value to false
test.setProperty("outputOnSSD", "false");
//here when i set a new value to fileDb propertie, a new line on config.txt with a new fileDb propertie name is created
//the question is: how to remove # from fileDb propetie name on config.txt to commons config recognize it as a existent propertie?
test.setProperty("fileDb", "C://aa//bb//cc");
test.saveModifications();
}
}
File "config.txt" used by ConfigTest class to change his value
########################################################################
# Local environment configuration
########################################################################
# Defines program localization/language
locale = pt-BR
# Temporary directory for processing: "default" uses the system temporary folder.
indexTemp = /home/downloads
# Enable if is on a SSD disk.
indexTempOnSSD = true
# Enable if output/case folder is on SSD. If enabled, index is created directly in case folder,
outputOnSSD = false
# Number of processing threads/workers: "default" uses the number of CPU logical cores.
numThreads = 14
# Full path for hash index database.
#fileDb = C:/test/test.db

How to get single GridFS file using Java driver 3.7+?

I need to get single the GridFS file using Java driver 3.7+.
I have two collections with file in a database: photo.files and photo.chunks.
The photo.chunks collection contains the binary file like:
The photo.files collection contains the metadata of the document.
To find document using simple database I wrote:
Document doc = collection_messages.find(eq("flag", true)).first();
String messageText = (String) Objects.requireNonNull(doc).get("message");
I tried to find file and wrote in same way as with an example above, according to my collections on screens:
MongoDatabase database_photos = mongoClient.getDatabase("database_photos");
GridFSBucket photos_fs = GridFSBuckets.create(database_photos,
"photos");
...
...
GridFSFindIterable gridFSFile = photos_fs.find(eq("_id", new ObjectId()));
String file = Objects.requireNonNull(gridFSFile.first()).getMD5();
And like:
GridFSFindIterable gridFSFile = photos_fs.find(eq("_id", new ObjectId()));
String file = Objects.requireNonNull(gridFSFile.first()).getFilename();
But I get an error:
java.lang.NullPointerException
at java.util.Objects.requireNonNull(Objects.java:203)
at project.Bot.onUpdateReceived(Bot.java:832)
at java.util.ArrayList.forEach(ArrayList.java:1249)
Also I checked docs of 3.7 driver, but this example shows how to find several files, but I need single:
gridFSBucket.find().forEach(
new Block<GridFSFile>() {
public void apply(final GridFSFile gridFSFile) {
System.out.println(gridFSFile.getFilename());
}
});
Can someone show me an example how to realize it properly?
I mean getting data, e.g. in chunks collection by Object_id and md5 field also by Object_id in metadata collection.
Thanks in advance.
To find and use specific files:
photos_fs.find(eq("_id", objectId)).forEach(
(Block<GridFSFile>) gridFSFile -> {
// to do something
});
or as alternative, I can find specific field of the file.
It can be done firstly by creating objectId of the first file, then pass it to GridFSFindIterable object to get particular field and value from database and get finally file to convert into String.
MongoDatabase database_photos =
mongoClient.getDatabase("database_photos");
GridFSBucket photos_fs = GridFSBuckets.create(database_photos,
"photos");
...
...
ObjectId objectId = Objects.requireNonNull(photos_fs.find().first()).getObjectId();
GridFSFindIterable gridFSFindIterable = photos_fs.find(eq("_id", objectId));
GridFSFile gridFSFile = Objects.requireNonNull(gridFSFindIterable.first());
String file = Objects.requireNonNull(gridFSFile).getMD5();
But it checks files from photo.files not from photo.chunkscollection.
And I'm not sure that this way is code-safe, because of debug info, but it works despite the warning:
Inconvertible types; cannot cast 'com.mongodb.client.gridfs.model.GridFSFile' to 'com.mongodb.client.gridfs.GridFSFindIterableImpl'

About inserting dynamic data into the static content at run time

I have a template like this in properties file: Dear xxxxxx, you are payment is succesfull.
After loading this template from properties file, I want to replace that "xxxxxx" with dynamic data in Java class.
Please help me on this.
I used Message.format("template",object array which is having dynamic data);
Try this way
placeholderReplacementMap is map that contain your static value and dynamic value key pair
Map<String, Object> placeholderReplacementMap = new HashMap<>();
StrSubstitutor substitutor = new StrSubstitutor(placeholderReplacementMap);
placeholderReplacementMap.put("xxxxxx", dynamicValue);
String newString = substitutor.replace("Dear xxxxxx","you are payment is succesful");

unexpected multiple execution of mapper intended to run once

I tried to write a very simple job with only 1 mapper and no reducer to write some data to hbase. In the mapper I tried to simply open connection with hbase, write a few rows of data to a table and then close connection. In job driver I am using JobConf.setNumMapTasks(1); and JobConf.setNumReduceTasks(0); to specify that only 1 mapper and no reducer are to be executed. I am also setting the reducer class to IdentityReducer in jobConf. The strange behavior I am observing is that the job successfully writes the data to hbase table however after that I see in the logs it continuously tried to open connection with hbase and then closes the connection which goes on for 20-30 minutes and after the job is declared to have completed with 100% success. At the end when I check the _success file created by the dummy data I put in OutputCollector.collect(...) I see hundred of rows of dummy data when there should only be 1.
Following is the code for job driver
public int run(String[] arg0) throws Exception {
Configuration config = HBaseConfiguration.create(getConf());
ensureRequiredParametersExist(config);
ensureOptionalParametersExist(config);
JobConf jobConf = new JobConf(config, getClass());
jobConf.setJobName(config.get(ETLJobConstants.ETL_JOB_NAME));
//set map specific configuration
jobConf.setNumMapTasks(1);
jobConf.setMaxMapAttempts(1);
jobConf.setInputFormat(TextInputFormat.class);
jobConf.setMapperClass(SingletonMapper.class);
jobConf.setMapOutputKeyClass(LongWritable.class);
jobConf.setMapOutputValueClass(Text.class);
//set reducer specific configuration
jobConf.setReducerClass(IdentityReducer.class);
jobConf.setOutputKeyClass(LongWritable.class);
jobConf.setOutputValueClass(Text.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setNumReduceTasks(0);
//set job specific configuration details like input file name etc
FileInputFormat.setInputPaths(jobConf, jobConf.get(ETLJobConstants.ETL_JOB_FILE_INPUT_PATH));
System.out.println("setting output path to : " + jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH));
FileOutputFormat.setOutputPath(jobConf,
new Path(jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH)));
JobClient.runJob(jobConf);
return 0;
}
Driver class extends Configured and implements Tool (I used the sample from definitive guide)Following is the code in my mapper class.
Following is the code in my Mapper's map method where I simply open the connection with Hbase, do some preliminary check to make sure table exists and then write the rows and close the table.
public void map(LongWritable arg0, Text arg1,
OutputCollector<LongWritable, Text> arg2, Reporter arg3)
throws IOException {
HTable aTable = null;
HBaseAdmin admin = null;
try {
arg3.setStatus("started");
/*
* set-up hbase config
*/
admin = new HBaseAdmin(conf);
/*
* open connection to table
*/
String tableName = conf.get(ETLJobConstants.ETL_JOB_TABLE_NAME);
HTableDescriptor htd = new HTableDescriptor(toBytes(tableName));
String colFamilyName = conf.get(ETLJobConstants.ETL_JOB_TABLE_COLUMN_FAMILY_NAME);
byte[] tablename = htd.getName();
/* call function to ensure table with 'tablename' exists */
/*
* loop and put the file data into the table
*/
aTable = new HTable(conf, tableName);
DataRow row = /* logic to generate data */
while (row != null) {
byte[] rowKey = toBytes(row.getRowKey());
Put put = new Put(rowKey);
for (DataNode node : row.getRowData()) {
put.add(toBytes(colFamilyName), toBytes(node.getNodeName()),
toBytes(node.getNodeValue()));
}
aTable.put(put);
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo added another data row to hbase");
row = fileParser.getNextRow();
}
aTable.flushCommits();
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo Finished adding data to hbase");
} finally {
if (aTable != null) {
aTable.close();
}
if (admin != null) {
admin.close();
}
}
arg2.collect(new LongWritable(10), new Text("something"));
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxoadded some dummy data to the collector");
}
As you could see around the end that I am writing some dummy data to collection in the end (10, 'something') and I see hundreds of rows of this data in the _success file after the job has terminated.
I can't identify why the mapper code is restarted multiple times over and over instead of running just once. Any help would be greatly appreciated.
Using JobConf.setNumMapTasks(1) is just saying to hadoop that you wish to use 1 mapper, if possible, unlike the setNumReduceTasks, which actually defines the number that you specified.
That's why more mappers are run and you observe all these numbers.
For more details, please read this post.

Bulk Insert Data into HBase using MapReduce

I need to insert 400 million rows into a HBase table.
Schema looks something like this
where I am generating key by simply concatenating int and int and value as System.nanoTime()
my mapper looks something like this
public class DatasetMapper extends Tablemapper <Text,LongWritable> {
private static Configuration conf = HBaseConfiguration.create();
public void map (Text key, LongWritable values, Context context) throws exception {
// instantiate HTable object that connects to table name
HTable htable = new HTable(conf,"temp") // already created temp table
htable.setAutoFlush(flase);
htable.setWriteBufferSize(1024*1024*12);
// construct key
int i = 0, j = 0;
for(i=0; i<400000000,i++) {
String rowkey = Integer.toString(i).concat(Integer.toString(j));
Long value = Math.abs(System.nanoTime());
Put put = new Put(Bytes.toBytes(rowkey));
put.add(Bytes.toBytes("location"),Bytes.toBytes("longlat"),Bytes.toBytes(value);
htable.put(put)
j++;
htable.flushCommits();
}
}
and my job looks like this
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"initdb");
job.setJarByClass(DatasetMapper.class); // class that contains mapper
TableMapReduceUtil.initTableMapperJob(
null, // input table
null,
DatabaseMapper.class, // mapper class
null, // mapper output key
null, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
temp, // output table
null, // reducer class
job);
job.setNumReduceTasks(0);
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
The job runs but inserts 0 records. I know I am making some mistake but I am not able to catch it as I am new to HBase. Please help me.
thanks
First things first, name of your mapper is DatasetMapper but in your job config you have specified DatabaseMapper. I am wondering how it is working without any error.
Next, it looks like you have mixed the TableMapper and Mapper usage together. Hbase TableMapper is an abstract class which extends Hadoop Mapper and helps us to read from HBase conveniently and TableReducer helps in writing back to HBase. You are trying to put data from your Mapper and you are using TableReducer at the same time. You mapper will actually never get called.
Either use TableReducer to put the data or use just Mapper. If you really wish to do it in your Mapper you can use TableOutputFormat class. See the example given at Page 301 of HBase Definitive Guide. This is the Google Books link
HTH
P.S. : You might find these links helpful in learning HBase+MR integration properly :
Link 1.
Link 2.

Categories