Hadoop - Writing to HBase directly from the Mapper - java

I have a haddop job that its output should be written to HBase. I do not really needs reducer, the kind of row I would like to insert is determined in the Mapper.
How can I use TableOutputFormat to achieve this? From all the examples I have seen the assumption is that the reducer is the one creating the Put, and that TableMapper is just for reading from HBase table.
In my case the input is HDFS the output is Put to specific table, I cannot find anything in TableMapReduceUtil that can help me with that either.
Is there any example out there that can help me with that?
BTW, I am using the new Hadoop API

This is the example of reading from file and put all lines into Hbase. This example is from "Hbase: The definitive guide" and you can find it on repository. To get it just clone repo on your computer:
git clone git://github.com/larsgeorge/hbase-book.git
In this book you can also find all the explanations about the code. But if something is incomprehensible for you, feel free to ask.
` public class ImportFromFile {
public static final String NAME = "ImportFromFile";
public enum Counters { LINES }
static class ImportMapper
extends Mapper<LongWritable, Text, ImmutableBytesWritable, Writable> {
private byte[] family = null;
private byte[] qualifier = null;
#Override
protected void setup(Context context)
throws IOException, InterruptedException {
String column = context.getConfiguration().get("conf.column");
byte[][] colkey = KeyValue.parseColumn(Bytes.toBytes(column));
family = colkey[0];
if (colkey.length > 1) {
qualifier = colkey[1];
}
}
#Override
public void map(LongWritable offset, Text line, Context context)
throws IOException {
try {
String lineString = line.toString();
byte[] rowkey = DigestUtils.md5(lineString);
Put put = new Put(rowkey);
put.add(family, qualifier, Bytes.toBytes(lineString));
context.write(new ImmutableBytesWritable(rowkey), put);
context.getCounter(Counters.LINES).increment(1);
} catch (Exception e) {
e.printStackTrace();
}
}
}
private static CommandLine parseArgs(String[] args) throws ParseException {
Options options = new Options();
Option o = new Option("t", "table", true,
"table to import into (must exist)");
o.setArgName("table-name");
o.setRequired(true);
options.addOption(o);
o = new Option("c", "column", true,
"column to store row data into (must exist)");
o.setArgName("family:qualifier");
o.setRequired(true);
options.addOption(o);
o = new Option("i", "input", true,
"the directory or file to read from");
o.setArgName("path-in-HDFS");
o.setRequired(true);
options.addOption(o);
options.addOption("d", "debug", false, "switch on DEBUG log level");
CommandLineParser parser = new PosixParser();
CommandLine cmd = null;
try {
cmd = parser.parse(options, args);
} catch (Exception e) {
System.err.println("ERROR: " + e.getMessage() + "\n");
HelpFormatter formatter = new HelpFormatter();
formatter.printHelp(NAME + " ", options, true);
System.exit(-1);
}
return cmd;
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String[] otherArgs =
new GenericOptionsParser(conf, args).getRemainingArgs();
CommandLine cmd = parseArgs(otherArgs);
String table = cmd.getOptionValue("t");
String input = cmd.getOptionValue("i");
String column = cmd.getOptionValue("c");
conf.set("conf.column", column);
Job job = new Job(conf, "Import from file " + input + " into table " + table);
job.setJarByClass(ImportFromFile.class);
job.setMapperClass(ImportMapper.class);
job.setOutputFormatClass(TableOutputFormat.class);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, table);
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Writable.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(input));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}`

You just need to make the mapper output the pair. OutputFormat only specifies how to persist the output key-values. It does not necessarily mean that the key values come from reducer.
You would need to do something like this in the mapper:
... extends TableMapper<ImmutableBytesWritable, Put>() {
...
...
context.write(<some key>, <some Put or Delete object>);
}

Related

MapReduce: Reduce function is writing strange values that are not expected

My reduce function in Java is writing on the output file values that are not expected. I inspect my code with breakpoints and I saw that, for each context.write call that I made, the key and the value that I'm writing are correct. Where am I making mistakes?
What I'm trying to do is taking in input row of type date, customer, vendor, amount that represent transactions and generate a dataset with row like date, user, balance where the balance is the sum of all transactions in which user was both customer or vendor.
Here is my code:
public class Transactions {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
var splittedValues = value.toString().split(",");
var date = splittedValues[0];
var customer = splittedValues[1];
var vendor = splittedValues[2];
var amount = splittedValues[3];
var reduceValue = new Text(customer + "," + vendor + "," + amount);
context.write(new Text(date), reduceValue);
}
}
public static class IntSumReducer
extends Reducer<Text,Text,Text,Text> {
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
Map<String, Integer> balanceByUserId = new ConcurrentHashMap<>();
values.forEach(transaction -> {
var splittedTransaction = transaction.toString().split(",");
var customer = splittedTransaction[0];
var vendor = splittedTransaction[1];
var amount = 0;
if (splittedTransaction.length > 2) {
amount = Integer.parseInt(splittedTransaction[2]);
}
if (!balanceByUserId.containsKey(customer)) {
balanceByUserId.put(customer, 0);
}
if (!balanceByUserId.containsKey(vendor)) {
balanceByUserId.put(vendor, 0);
}
balanceByUserId.put(customer, balanceByUserId.get(customer) - amount);
balanceByUserId.put(vendor, balanceByUserId.get(vendor) + amount);
});
balanceByUserId.entrySet().forEach(entry -> {
var reducerValue = new Text(entry.getKey() + "," + entry.getValue().toString());
try {
context.write(key, reducerValue);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
});
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "transactions");
job.setJarByClass(Transactions.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
where the balance is the sum of all transactions in which user was both customer or vendor
balanceByUserId exists only for each unique date since your map key is the date.
If you want to aggregate by customer info (name / ID?), then customer should be the key of the mapper output.
Once all data from each customer is grouped by the reducer, you can then sort by date, if needed, but aggregate by other details.
Also worth pointing out that this would be easier in Hive or SparkSQL rather than Mapreduce.

Creating multiple files after receiving a queue message

I have a project that reads from mongodb and sends that information to the queue. my listener picks up the queue message from cloud. I am able to create a .txt file that inputs all the information inside from the queue. My problem that I have been searching for is: How can I sort through a specific field inside the POJO (IBusiness1,IBusiness2,IBusiness3)and create the file for each? The following code allows me to create only 1 txt file and it does not sort the field:
public static void main(String[] args) {
SpringApplication.run(PaymentPortalBatchListenerApplication.class, args);
}
private class MessageHandler implements IMessageHandler {
private final Logger logger = LoggerFactory.getLogger(MessageHandler.class);
public CompletableFuture<Void> onMessageAsync(IMessage message) {
System.out.println("received "+message.getBody());
ObjectMapper om = new ObjectMapper();
PortalList auditList = null;
try {
auditList = om.readValue( message.getBody(), PortalList.class );
System.out.println( "**Audit Message " + auditList );
logger.info( "Creating file");
String exportFilePath = "C:\\filewriter\\IBusiness1 " +
LocalDateTime.now().format(formatter) + ".txt";
File file = new File(exportFilePath);
FileWriter writeToFile = new FileWriter(file);
String exportFileHeader = "CREATE_DTTM|FNAME|LNAME|IBusiness";
StringHeaderWriter headerWriter = new
StringHeaderWriter(exportFileHeader);
writeToFile.write(exportFileHeader);
writeToFile.write( String.valueOf( headerWriter));
writeToFile.write( String.valueOf(auditList));
writeToFile.flush();
} catch (IOException e) {
e.printStackTrace();
}
// System.out.println(auditList);
return CompletableFuture.completedFuture(null);
}
Here is what I did:
PaymentPortalBean = POJO
auditlist =on-prem copy of PPB
PortalList =
import lombok.Data;
import java.util.List;
#Data
public class PortalList{
private List<PaymentPortalBean> portalList;
}
answer to creating files:
for(PaymentPortalBean bean: auditList.getPortalList()) {
if(bean.RxBusiness().contains( "IBusiness")){
File file = new File( exportFilePath );
FileWriter writeToFile = new FileWriter( file );
String exportFileHeader = CREATE_DTTM|FNAME|LNAME|IBusiness";
writeToFile.write( exportFileHeader );
writeToFile.write( String.valueOf(bean));
writeToFile.flush();
}
that worked to to find IBusiness, I created two more conditional statements for the types I needed. runs fine.
Mongo db was able to separate the fields I needed.

CSVReader does not check the whole file

I am trying to open a csv file using openCSV, iterate over every column and if the userID is different write a new JavaBean pair at the end of the file.
The problem is that the reader only checks the first column of my file and not the whole file. While created, the file contains only a header and nothing else. The program will check every column and if the sudoID is different it will write it to the file. If the sudoID in the first line is equal to the the one imported from my main class it will recognise it and not write it. But if this -same- sudoID is in the second row it will not recognise it and will write it again.
For instance, if my CSV looks like this it will work:
"Patient_id Pseudo_ID",
"32415","PAT106663926"
If it looks like this it will re-write the sudoID:
"Patient_id Pseudo_ID",
"32416","PAT104958880"
"32415","PAT106663926"
Thanks!
My Code:
public class CSVConnection {
#SuppressWarnings({ "deprecation", "resource", "rawtypes", "unchecked" })
public String getID(String sID,String pseudoID) throws IOException, CsvDataTypeMismatchException, CsvRequiredFieldEmptyException{
try {
CsvToBean csv = new CsvToBean();
String csvFilename = "CsvFile.csv";
Writer writer= new FileWriter(csvFilename,true);
CSVReader csvReader = new CSVReader(new FileReader(csvFilename),',','"',1);
ColumnPositionMappingStrategy strategy = new ColumnPositionMappingStrategy();
strategy.setType(PatientCSV.class);
String[] columns = new String[] {"patID","pseudoID"};
strategy.setColumnMapping(columns);
//Set column mapping strategy
StatefulBeanToCsv<PatientCSV> bc = new StatefulBeanToCsvBuilder<PatientCSV>(writer).withMappingStrategy(strategy).build();
List patList = csv.parse(strategy, csvReader);
for (Object patObj : patList) {
PatientCSV pat = (PatientCSV) patObj;
if(((PatientCSV) patObj).getPatID().equals(sID)){
return pat.getPseudoID();
}
else
{
PatientCSV pat1 = new PatientCSV();
pat1.setPatID(sID);
pat1.setPseudoID(pseudoID);
patList.add(pat1);
/*Find a way to import it to the CSV*/
bc.write(pat1);
writer.close();
return pseudoID;
}
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
public static void main(String [] args) throws IOException, CsvDataTypeMismatchException, CsvRequiredFieldEmptyException{
CSVConnection obj = new CSVConnection();
String sID="32415";
String pseudoID="PAT101830150";
obj.getID(sID,pseudoID);
}
}
and the Java Bean :
public class PatientCSV {
private String patID;
private String pseudoID;
public String getPatID() {
return patID;
}
public void setPatID(String patID) {
this.patID = patID;
}
public String getPseudoID() {
return pseudoID;
}
public void setPseudoID(String pseudoID) {
this.pseudoID = pseudoID;
}
public PatientCSV(String patID, String pseudoID) {
super();
this.patID = patID;
this.pseudoID = pseudoID;
}
public PatientCSV() {
super();
// TODO Auto-generated constructor stub
}
public String toString()
{
return "Patient [id=" + patID + ", pseudoID=" + pseudoID + "]";
}
}
Lets inspect your for loop
for (Object patObj : patList) {
PatientCSV pat = (PatientCSV) patObj;
if(((PatientCSV) patObj).getPatID().equals(sID)){
return pat.getPseudoID();
}
else
{
PatientCSV pat1 = new PatientCSV();
pat1.setPatID(sID);
pat1.setPseudoID(pseudoID);
patList.add(pat1);
/*Find a way to import it to the CSV*/
bc.write(pat1);
writer.close();
return pseudoID;
}
}
So in the case you mention it is not working as expected, meaning that the line that matches your input is the second line:
"Patient_id Pseudo_ID",
"32416","PAT104958880"
"32415","PAT106663926"
So you call: getID("32415", "PAT106663926")
What happens in your loop is:
You take the first element of your csv patients, the one with id: 32416,
check if it matches with the id given as input to your method, 32415.
It does not match so it goes to the else part. There it creates the new patient (with the same patID and pseudoID as the 2nd row of your csv) and stores it in the file.
So by now you should have 2 entries in your csv with the same data "32415","PAT106663926".
I think that this is the error, in your for loop you should check against all entries if there is a match, and then create the patient and store it to the csv.
An example:
PatientCSV foundPatient = null;
for (Object patObj : patList) {
PatientCSV pat = (PatientCSV) patObj;
if(((PatientCSV) patObj).getPatID().equals(sID)){
foundPatient = pat;
}
}
if (foundPatient == null) {
foundPatient = new PatientCSV();
foundPatient.setPatID(sID);
foundPatient.setPseudoID(pseudoID);
patList.add(foundPatient);
/*Find a way to import it to the CSV*/
bc.write(foundPatient);
writer.close();
}
return foundPatient.getPseudoID();
P.S. The above example is written very quickly, just to give you the idea what needs to be done.

Load data via HFile into HBase not working

I wrote a mapper to load data from disk via HFile into HBase, the program runs successfully, but there's no data loaded in my HBase table, any ideas on this please?
Here's my java program:
protected void writeToHBaseViaHFile() throws Exception {
try {
System.out.println("In try...");
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "XXXX");
Connection connection = ConnectionFactory.createConnection(conf);
System.out.println("got connection");
String inputPath = "/tmp/nuggets_from_Hive/part-00000";
String outputPath = "/tmp/mytemp" + new Random().nextInt(1000);
final TableName tableName = TableName.valueOf("steve1");
System.out.println("got table steve1, outputPath = " + outputPath);
// tag::SETUP[]
Table table = connection.getTable(tableName);
Job job = Job.getInstance(conf, "ConvertToHFiles");
System.out.println("job is setup...");
HFileOutputFormat2.configureIncrementalLoad(job, table,
connection.getRegionLocator(tableName)); // <1>
System.out.println("done configuring incremental load...");
job.setInputFormatClass(TextInputFormat.class); // <2>
job.setJarByClass(Importer.class); // <3>
job.setMapperClass(LoadDataMapper.class); // <4>
job.setMapOutputKeyClass(ImmutableBytesWritable.class); // <5>
job.setMapOutputValueClass(KeyValue.class); // <6>
FileInputFormat.setInputPaths(job, inputPath);
HFileOutputFormat2.setOutputPath(job, new org.apache.hadoop.fs.Path(outputPath));
System.out.println("Setup complete...");
// end::SETUP[]
if (!job.waitForCompletion(true)) {
System.out.println("Failure");
} else {
System.out.println("Success");
}
} catch (Exception e) {
e.printStackTrace();
}
}
Here's my mapper class:
public class LoadDataMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Cell> {
public static final byte[] FAMILY = Bytes.toBytes("pd");
public static final byte[] COL = Bytes.toBytes("bf");
public static final ImmutableBytesWritable rowKey = new ImmutableBytesWritable();
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split("\t"); // <1>
byte[] rowKeyBytes = Bytes.toBytes(line[0]);
rowKey.set(rowKeyBytes);
KeyValue kv = new KeyValue(rowKeyBytes, FAMILY, COL, Bytes.toBytes(line[1])); // <6>
context.write (rowKey, kv); // <7>
System.out.println("line[0] = " + line[0] + "\tline[1] = " + line[1]);
}
}
I've created the table steve1 in my cluster, but got 0 rows after the program runs successfully:
hbase(main):007:0> count 'steve1'
0 row(s) in 0.0100 seconds
=> 0
What I've tried:
I tried to add print out message as in the mapper class to see if it really read the data, but the printouts never got printed in my console.
I'm at a loss at how to debug this.
Any ideas is greatly appreciated!
This is only to create HFiles, you still need to load HFile onto your table. For example, you need to do something like:
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(new Path(outputPath), admin, hTable, regionLocator);

Synchronization Issue while using Apache Storm

I am trying Apache Storm for Processing Streams of GeoHash Codes. I am using this library and Apache Storm 0.9.3. The geohash details for python can be found at enter link description here.
Currently, I am facing an synchronization issue in the execute method of one BOLT class. I have tried using a single bold, which gives me the correct output. But the moment I go from one Bolt thread to two or more. The output gets messed up.
The code snippet for one of the BOLT(Only this is having issues) is:
public static int PRECISION=6;
private OutputCollector collector;
BufferedReader br;
String lastGeoHash="NONE";
HashMap<String,Integer> map;
HashMap<String,String[]> zcd;
TreeMap<Integer,String> counts=new TreeMap<Integer,String>();
public void prepare( Map conf, TopologyContext context, OutputCollector collector )
{
String line="";
this.collector = collector;
map=new HashMap<String,Integer>();
zcd=new HashMap<String,String[]>();
try {
br = new BufferedReader(new FileReader("/tmp/zip_code_database.csv"));
int i=0;
while ((line = br.readLine()) != null) {
if(i==0){
String columns[]=line.split(",");
for(int j=0;j<columns.length;j++){
map.put(columns[j],j);
}
}else{
String []split=line.split(",");
zcd.put(split[map.get("\"zip\"")],new String[]{split[map.get("\"state\"")],split[map.get("\"primary_city\"")]});
}
i++;
}
br.close();
// System.out.println(zcd);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Initialize");
initializeTreeMapAsPerOurRequirement(counts);
}
public void execute( Tuple tuple )
{
String completeFile = tuple.getStringByField("string");//So, this data is generated by Spout and it contains the complete shape file where each line is separated by a new line character i.e. "\n"
String lines[]=completeFile.split("\t");
String geohash=lines[0];
int count=Integer.parseInt(lines[1]);
String zip=lines[2];
String best="";
String city="";
String state="";
if(!(geohash.equals(lastGeoHash)) && !(lastGeoHash.equals("NONE"))){
//if(counts.size()!=0){
//System.out.println(counts.firstKey());
best=counts.get(counts.lastKey());
//System.out.println(geohash);
if(zcd.containsKey("\""+best+"\"")){
city = zcd.get("\""+best+"\"")[0];
state = zcd.get("\""+best+"\"")[1];
System.out.println(lastGeoHash+","+best+","+state+","+city+","+"US");
}else if(!best.equals("NONE")){
System.out.println(lastGeoHash);
city="MISSING";
state="MISSING";
}
// initializeTreeMapAsPerOurRequirement(counts);
//}else{
//System.out.println("else"+geohash);
//}
//}
}
lastGeoHash=geohash;
counts.put(count, zip);
collector.ack( tuple );
}
private void initializeTreeMapAsPerOurRequirement(TreeMap<Integer,String> counts){
counts.clear();
counts.put(-1,"NONE");
}
public void declareOutputFields( OutputFieldsDeclarer declarer )
{
System.out.println("here");
declarer.declare( new Fields( "number" ) );
}
Topology code is:
public static void main(String[] args)
{
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout( "spout", new SendWholeFileDataSpout(),2);
builder.setBolt( "map", new GeoHashBolt(),2).shuffleGrouping("spout");
builder.setBolt("reduce",new GeoHashReduceBolt(),2).fieldsGrouping("map", new Fields("value"));
Config conf = new Config();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf, builder.createTopology());
Utils.sleep(10000);
cluster.killTopology("test");
cluster.shutdown();
}
Can someone look into the code and guide me a bit.
You have set the parallelism_hint to 2 for your spout and both of your bolts. It means 2 executers will run per component, which may mess-up your output.
By setting parallelism_hint to 1 you may achieve your desired output.

Categories