Why updating broadcast variable sample code didn't work? - java

I want to update broadcast variable every minute. So I use the sample code you give by Aastha in this question.
how can I update a broadcast variable in Spark streaming?
But it didn't work. The function updateAndGet() only works when the streaming application start. When I debug my code , it didn't went into the fuction updateAndGet() twice. So the broadcast variable didn't update every minute.
Why?
Here is my sample code.
public class BroadcastWrapper {
private Broadcast<List<String>> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();
private static BroadcastWrapper obj = new BroadcastWrapper();
private BroadcastWrapper(){}
public static BroadcastWrapper getInstance() {
return obj;
}
public JavaSparkContext getSparkContext(SparkContext sc) {
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
return jsc;
}
public Broadcast<List<String>> updateAndGet(JavaStreamingContext jsc) {
Date currentDate = Calendar.getInstance().getTime();
long diff = currentDate.getTime()-lastUpdatedAt.getTime();
if (broadcastVar == null || diff > 60000) { // Lets say we want to refresh every 1 min =
// 60000 ms
if (broadcastVar != null)
broadcastVar.unpersist();
lastUpdatedAt = new Date(System.currentTimeMillis());
// Your logic to refreshs
// List<String> data = getRefData();
List<String> data = new ArrayList<String>();
data.add("tang");
data.add("xiao");
data.add(String.valueOf(System.currentTimeMillis()));
broadcastVar = jsc.sparkContext().broadcast(data);
}
return broadcastVar;}}
//Here is the computing code submit to spark streaming.
lines.transform(new Function<JavaRDD<String>, JavaRDD<String>>() {
Broadcast<List<String>> blacklist =
BroadcastWrapper.getInstance().updateAndGet(jsc);
#Override
public JavaRDD<String> call(JavaRDD<String> rdd) {
JavaRDD<String> dd=rdd.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String word) {
if (blacklist.getValue().contains(word)) {
return false;
} else {
return true;
}
}
});
return dd;
}});

Related

Why Spark writes Null in DeltaLake Table

My Java application uses a Spark Structured Streaming connected to a socket server continuously getting sensor measurement records (IoT) wrapped in an RDMessage object that records the message type for control in the protocol.
When messages arrive they are checked and converted into a Dataset using Encoder<RDMeasurement> measurementEncoder = Encoders.bean(RDMeasurement.class).
Although the stream is read correctly and the RDMeasurement objects are created correctly, the output stream is set to None or zero depending on the data type. I see this in the DeltaFrame table or in the console when I change the format (.format("console")).
What did I miss here? what is going wrong?
See below the most significant segments of Java code
public final class SocketRDMeasurement {
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession
.builder()
.appName("SSSocketRDMeasurement")
.master("local[*]")
.getOrCreate();
Encoder<StringArray> stringArrayEncoder = Encoders.bean(StringArray.class);
Encoder<RDMessage> messageEncoder = Encoders.bean(RDMessage.class);
Encoder<RDMeasurement> measurementEncoder = Encoders.bean(RDMeasurement.class);
Dataset<Row> records = spark
.readStream()
.format("socket")
.option("host", host)
.option("port", port)
.load();
Dataset<String> inputReceived = records.as(Encoders.STRING());
Dataset<StringArray> input = inputReceived.as(Encoders.STRING())
.map((MapFunction<String, StringArray>) x ->
new StringArray(x),
stringArrayEncoder);
Dataset<RDMessage> messages = input.map(
(MapFunction<StringArray, RDMessage>)
r -> new RDMessage(r), messageEncoder);
Dataset<RDMeasurement> measurements = messages
.map((MapFunction<RDMessage, RDMeasurement>) r ->
new RDMeasurement(), measurementEncoder);
// The code executes without warning or error but despite the
// objects being created correctly the output of dataset is
// is saved with nulls/nan
StreamingQuery query = measurements.writeStream()
.outputMode("append")
.format("delta")
.option("checkpointLocation",
"/opt/data/delta/_checkpoints/ss-socket-rd-measurement")
.start("/opt/data/delta/ss-socket-rd-measurement");
query.awaitTermination();
}
}
public class StringArray implements Serializable {
private String[] tokens;
public StringArray(String tokens) {
this.tokens = tokens.split(",");
}
// getters, setters and toString goes here
}
public class RDMeasurement implements Serializable {
private String dataSourceName = null;
private double dt = 0.0f;
private double t0 = 0f;
private double endTimestamp = 0L;
private double[] valuesArray;
public RDMeasurement() { }
public RDMeasurement(String dataSourceName, double t0,
double dt, double endTimestamp, double[] valuesArray) {
this.dataSourceName = dataSourceName;
this.t0 = t0;
this.dt = dt;
this.endTimestamp = endTimestamp;
this.valuesArray = valuesArray;
}
// getters, setters and toString goes here
}
public class RDMessage implements Serializable {
String type;
RDMeasurement rdMeasurement;
public RDMessage(String type, RDMeasurement rdMeasurement) {
this.type = type;
this.rdMeasurement = rdMeasurement;
}
public RDMessage(StringArray stringArray) {
this(stringArray.getTokens()[0] ,
new RDMeasurement(stringArray.getTokens()[1],
Double.parseDouble(stringArray.getTokens()[2]),
Double.parseDouble(stringArray.getTokens()[3]),
Double.parseDouble(stringArray.getTokens()[4]),
toDoubleArray(5, stringArray))
);
}
private static double[] toDoubleArray(int skip, StringArray stringArray) {
double[] ret = new double[stringArray.getTokens().length - 5];
for (int i = 0; i < stringArray.getTokens().length - 5; i++) {
ret[i] = Double.parseDouble(stringArray.getTokens()[i+skip]);
}
return ret;
}
// getters, setters and toString goes here
}
Each line of input follows the format below:
V1_start_rd_0,ds_1,1642442598.266,1.0,1642442618.266,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9.00,10.00,11.00,12.00,13.00,14.00,15.00,16.00,17.00,18.00,19.00,20.00
V1_rd_1,ds_2,1642442619.266,1.0,1642442639.266,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9.00,10.00,11.00,12.00,13.00,14.00,15.00,16.00,17.00,18.00,19.00,20.00
V1_rd_2,ds_3,1642442640.266,1.0,1642442660.266,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9.00,10.00,11.00,12.00,13.00,14.00,15.00,16.00,17.00,18.00,19.00,20.00
After refactoring my java code and adding a code segment for debug I was able to identify the error.
See the refactoring:
StreamingQuery query = dataStreamReader.load()
.as(Encoders.STRING())
.map((MapFunction<String, StringArray>) x -> new StringArray(x),
stringArrayEncoder)
.map((MapFunction<StringArray, RDMessage>)
r -> new RDMessage(r), messageEncoder)
.map((MapFunction<RDMessage, RDMeasurement>) e ->
e.getRdMeasurement(), measurementEncoder)
/*
.map((MapFunction<RDMeasurement, String>) e -> {
if (e.getDataSourceName() != null) {
System.out.println("•••> " + e);
}
return e.toString();
}, Encoders.STRING())
.map((MapFunction<String, RDMeasurement>) s -> new RDMeasurement(s),
measurementEncoder)
*/
.writeStream()
.outputMode("append")
.format("console")
.start();
query.awaitTermination();
The code commented above allowed me to identify the problem.

How to cache search results retrieved from auto-complete API

I've been working on a searching auto-complete feature with Bootstrap-3-Typeahead ,GWT and data comes from http://dawa.aws.dk/dok/api by JsonP request..
The Typahead creation method is below
private Typeahead<Location> createTypeAhead() {
typeAhead = new Typeahead<>(new Dataset<Location>() {
#Override
public void findMatches(final String query, final SuggestionCallback<Location> callback) {
requestCounter--;
startSendingRequest = true;
clear.setIcon(IconType.SPINNER);
clear.setIconSpin(true);
final Set<Suggestion<Location>> suggestions = new HashSet<>();
queryLower = query.toLowerCase();
JsonpRequestBuilder jsonpRequestBuilder;
if (!streetSelected) {
jsonpRequestBuilder = new JsonpRequestBuilder();
jsonpRequestBuilder.requestObject("https://dawa.aws.dk/vejnavne/autocomplete?side=1&per_side=500&noformat=1&q=" + queryLower + "*", new AsyncCallback<MyJsArray<VejAutocomplete>>() {
#Override
public void onFailure(Throwable caught) {
Notify.notify("suggestion matches failed");
}
#Override
public void onSuccess(MyJsArray<VejAutocomplete> result) {
Set<Location> locationSet = new LinkedHashSet<>();
for (VejAutocomplete item : result.getAsList()) {
String lowerCase = item.getTekst().toLowerCase();
if (lowerCase.startsWith(queryLower)) {
locationSet.add(new Location(Location.LocationType.STREET, item.getTekst(), item));
locationArrayList.clear();
locationArrayList.addAll(locationSet);
}
}
}
});
}
for (Location address : locationArrayList) {
String value = address.getValue();
Suggestion<Location> s = Suggestion.create(value, address, this);
if (address.getValue().toLowerCase().startsWith(queryLower)) {
suggestions.add(s);
}
}
callback.execute(suggestions);
if (typeAhead.getValue().length() != 0 && queryLower.length() <= 5 && requestCounter < 5 && requestCounter > 0) {
new Timer() {
#Override
public void run() {
findMatches(queryLower, callback);
}
}.schedule(500);
} else {
clear.setIconSpin(false);
clear.setIcon(IconType.CLOSE);
requestCounter = 5;
}
}
});
return typeAhead;
}
The result is like below:
I used recursion to send 4-5 times of request because it's not showing the suggestion list with the keyword of single letter. And It still won't work with some single letters like "s" or "e". Data successfully retrieved from the API but doesn't show in the suggestion list like below:
I assume that I should cache all search results then recreate the auto-complete from scratch, it becomes complicated in that case.
Any good idea to solve this problem?

Date field change during Retrofit uploads

I am using Retrofit2 for upload SOAP request. Date automatically changes during the upload process in 10 out 1000 requests. I have investigated a lot and pin point the issue when I started logging the call with interceptor.
Log Printed before interceptor
fromTime:2018-10-13T13:22:00
isDeleted:false
name:Unknown visitor
reason:Unknown reason
toTime:2018-10-13T23:59:59
vrm:ABC
Log Printed in Interceptor.
FromTime:2018-10-20T11:59:00
Name:Unknown visitor
Reason:Unknown reason
ToTime:2018-10-13T23:59:59
Vrm:ABC
isDeleted:false
It can be seen fromTime:2018-10-13T13:22:00 changes to FromTime:2018-10-20T11:59:00.
Build request is as follow
public EnvelopeSender build() {
OkHttpClient client = new OkHttpClient.Builder().addInterceptor(new EnvelopInterceptor()).build();
Retrofit retrofit = new Retrofit.Builder().client(client)
.baseUrl(url)
.addConverterFactory(SimpleXmlConverterFactory.create(serializer))
.addCallAdapterFactory(SynchronousCallAdapterFactory.create())
.build();
return retrofit.create(EnvelopeSender.class);
}
Serialisation factory is as follow.
public class XMLSerializerFactory {
private Strategy strategy;
private DateFormat format;
private RegistryMatcher registryMatcher;
public Serializer getSerializer() {
if (strategy == null && registryMatcher == null) {
return new Persister();
} else if (strategy == null) {
return new Persister(registryMatcher);
} else if (registryMatcher == null) {
return new Persister(strategy);
} else {
return new Persister(strategy, registryMatcher);
}
}
public XMLSerializerFactory withStrategy() {
strategy = new AnnotationStrategy();
return this;
}
public XMLSerializerFactory withDateFormat() {
format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss", Locale.UK);
return this;
}
public XMLSerializerFactory withRegistryMatcher() {
if (format == null)
format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss", Locale.UK);
registryMatcher = new RegistryMatcher();
registryMatcher.bind(Date.class, new DateFormatTransformer(format));
return this;
}
}
public class DateFormatTransformer implements Transform<Date> {
private DateFormat dateFormat;
public DateFormatTransformer(DateFormat dateFormat) {
this.dateFormat = dateFormat;
}
#Override
public Date read(String value) throws Exception {
return dateFormat.parse(value);
}
#Override
public String write(Date value) throws Exception {
return dateFormat.format(value);
}
}
Is it bug or I am doing something wrong?
Thanks in advance.

Spark streaming transform function

I am having compilation errors in the transform function for spark streaming.
Specifically seem to be missing finalizing the DStream variable or something similar. I have copied from the amplab tutorials so slightly confused...
Here is the code, the problem is in the transform function towards the end.
Here is the error:
[ERROR] /home/nipun/ngla-stable/online/src/main/java/org/necla/ngla/spark_streaming/Type4ViolationChecker.java:[120,63] error:
no suitable method found for transform(<anonymous Function<JavaPairRDD<Long,Integer>,JavaPairRDD<Long,Integer>>>)
[INFO] 1 error
Code:
public class Type4ViolationChecker {
private static final Pattern NEWSPACE = Pattern.compile("\n");
public static Long generateTSKey(String line) throws ParseException{
JSONObject obj = new JSONObject(line);
String time = obj.getString("mts");
DateFormat formatter = new SimpleDateFormat("yyyy / MM / dd HH : mm : ss");
Date date = (Date)formatter.parse(time);
long since = date.getTime();
long key = (long)(since/10000) * 10000;
return key;
}
public static void main(String[] args) {
Type4ViolationChecker obj = new Type4ViolationChecker();
SparkConf sparkConf = new SparkConf().setAppName("Type4ViolationChecker");
final JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(10000));
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String x) {
return Lists.newArrayList(NEWSPACE.split(x));
}
});
words.persist();
JavaDStream<String> matched = words.filter(new Function<String, Boolean>() {
public Boolean call(String line) {
return line.contains("pattern");
}});
JavaPairDStream<Long, Integer> keyValStream = matched.mapToPair(
new PairFunction<String, Long, Integer>(){
/**
* Here we are converting the string to a key value tuple
* Key -> time bucket calculated using the 1970 GMT date as anchor, and dividing by the polling interval
* Value -> is the original message
*/
#Override
public Tuple2<Long, Integer> call(String arg0)
throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Long,Integer>(generateTSKey(arg0),1);
}
});
JavaPairDStream<Long, Integer> tsStream = keyValStream.reduceByKey(
new Function2<Integer,Integer,Integer>(){
public Integer call(Integer i1, Integer i2){
return i1+ i2;
}});
JavaPairDStream<Long,Integer> sortedtsStream = tsStream.transform(
new Function<JavaPairRDD<Long, Integer>, JavaPairRDD<Long,Integer>>() {
#Override
public JavaPairRDD<Long, Integer> call(JavaPairRDD<Long, Integer> longIntegerJavaPairRDD) throws Exception {
return longIntegerJavaPairRDD.sortByKey(false);
}
});
//sortedtsStream.print();
ssc.start();
ssc.awaitTermination();
}
}
Thanks to #GaborBakos for providing the answer...
The following seems to work! Had to use transformtoPair, instead of transform
JavaPairDStream<Long,Integer> sortedtsStream = tsStream.transformToPair(
new Function<JavaPairRDD<Long, Integer>, JavaPairRDD<Long,Integer>>() {
#Override
public JavaPairRDD<Long, Integer> call(JavaPairRDD<Long, Integer> longIntegerJavaPairRDD) throws Exception {
return longIntegerJavaPairRDD.sortByKey(true);
}
});

Join with Hadoop in Java [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I'm working since short time with Hadoop and trying to implement a join in Java. It doesn't matter if Map-Side or Reduce-Side. I took Reduce-Side join since it was supposed to be easier to implement. I know that Java is not the best choice for joins, aggregations etc. and should better pick Hive or Pig which I have done already. However I'm working on a research project and I have to use all of those 3 languages in order to deliver a comparison.
Anyway, I have two input files with different structure. One is key|value and the other one is key|value1;value2;value3;value4. One record from each input file looks like following:
Input1: 1;2010-01-10T00:00:01
Input2: 1;23;Blue;2010-01-11T00:00:01;9999-12-31T23:59:59
I followed the example in the Hadoop Definitve Guide book, but it didn't work for me. I'm posting my java files here, so you can see what's wrong.
public class LookupReducer extends Reducer<TextPair,Text,Text,Text> {
private String result = "";
private String msisdn;
private String attribute, product;
private long trans_dt_long, start_dt_long, end_dt_long;
private String trans_dt, start_dt, end_dt;
#Override
public void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
context.progress();
//value without key to remember
Iterator<Text> iter = values.iterator();
for (Text val : values) {
Text recordNoKey = val; //new Text(iter.next());
String valSplitted[] = recordNoKey.toString().split(";");
//if the input is coming from CDR set corresponding values
if(key.getSecond().toString().equals(CDR.CDR_TAG))
{
trans_dt = recordNoKey.toString();
trans_dt_long = dateToLong(recordNoKey.toString());
}
//if the input is coming from Attributes set corresponding values
else if(key.getSecond().toString().equals(Attribute.ATT_TAG))
{
attribute = valSplitted[0];
product = valSplitted[1];
start_dt = valSplitted[2];
start_dt_long = dateToLong(valSplitted[2]);
end_dt = valSplitted[3];
end_dt_long = dateToLong(valSplitted[3]);;
}
Text record = val; //iter.next();
//System.out.println("RECORD: " + record);
Text outValue = new Text(recordNoKey.toString() + ";" + record.toString());
if(start_dt_long < trans_dt_long && trans_dt_long < end_dt_long)
{
//concat output columns
result = attribute + ";" + product + ";" + trans_dt;
context.write(key.getFirst(), new Text(result));
System.out.println("KEY: " + key);
}
}
}
private static long dateToLong(String date){
DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date parsedDate = null;
try {
parsedDate = formatter.parse(date);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
long dateInLong = parsedDate.getTime();
return dateInLong;
}
public static class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair(){
set(new Text(), new Text());
}
public TextPair(String first, String second){
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second){
set(first, second);
}
public void set(Text first, Text second){
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public void setFirst(Text first) {
this.first = first;
}
public Text getSecond() {
return second;
}
public void setSecond(Text second) {
this.second = second;
}
#Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
first.readFields(in);
second.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
first.write(out);
second.write(out);
}
#Override
public int hashCode(){
return first.hashCode() * 163 + second.hashCode();
}
#Override
public boolean equals(Object o){
if(o instanceof TextPair)
{
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
#Override
public String toString(){
return first + ";" + second;
}
#Override
public int compareTo(TextPair tp) {
// TODO Auto-generated method stub
int cmp = first.compareTo(tp.first);
if(cmp != 0)
return cmp;
return second.compareTo(tp.second);
}
public static class FirstComparator extends WritableComparator {
protected FirstComparator(){
super(TextPair.class, true);
}
#Override
public int compare(WritableComparable comp1, WritableComparable comp2){
TextPair pair1 = (TextPair) comp1;
TextPair pair2 = (TextPair) comp2;
int cmp = pair1.getFirst().compareTo(pair2.getFirst());
if(cmp != 0)
return cmp;
return -pair1.getSecond().compareTo(pair2.getSecond());
}
}
public static class GroupComparator extends WritableComparator {
protected GroupComparator()
{
super(TextPair.class, true);
}
#Override
public int compare(WritableComparable comp1, WritableComparable comp2)
{
TextPair pair1 = (TextPair) comp1;
TextPair pair2 = (TextPair) comp2;
return pair1.compareTo(pair2);
}
}
}
}
public class Joiner extends Configured implements Tool {
public static final String DATA_SEPERATOR =";"; //Define the symbol for seperating the output data
public static final String NUMBER_OF_REDUCER = "1"; //Define the number of the used reducer jobs
public static final String COMPRESS_MAP_OUTPUT = "true"; //if the output from the mapping process should be compressed, set COMPRESS_MAP_OUTPUT = "true" (if not set it to "false")
public static final String
USED_COMPRESSION_CODEC = "org.apache.hadoop.io.compress.SnappyCodec"; //set the used codec for the data compression
public static final boolean JOB_RUNNING_LOCAL = true; //if you run the Hadoop job on your local machine, you have to set JOB_RUNNING_LOCAL = true
//if you run the Hadoop job on the Telefonica Cloud, you have to set JOB_RUNNING_LOCAL = false
public static final String OUTPUT_PATH = "/home/hduser"; //set the folder, where the output is saved. Only needed, if JOB_RUNNING_LOCAL = false
public static class KeyPartitioner extends Partitioner<TextPair, Text> {
#Override
public int getPartition(/*[*/TextPair key/*]*/, Text value, int numPartitions) {
System.out.println("numPartitions: " + numPartitions);
return (/*[*/key.getFirst().hashCode()/*]*/ & Integer.MAX_VALUE) % numPartitions;
}
}
private static Configuration hadoopconfig() {
Configuration conf = new Configuration();
conf.set("mapred.textoutputformat.separator", DATA_SEPERATOR);
conf.set("mapred.compress.map.output", COMPRESS_MAP_OUTPUT);
//conf.set("mapred.map.output.compression.codec", USED_COMPRESSION_CODEC);
conf.set("mapred.reduce.tasks", NUMBER_OF_REDUCER);
return conf;
}
#Override
public int run(String[] args) throws Exception {
// TODO Auto-generated method stub
if ((args.length != 3) && (JOB_RUNNING_LOCAL)) {
System.err.println("Usage: Lookup <CDR-inputPath> <Attribute-inputPath> <outputPath>");
System.exit(2);
}
//starting the Hadoop job
Configuration conf = hadoopconfig();
Job job = new Job(conf, "Join cdrs and attributes");
job.setJarByClass(Joiner.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, CDRMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AttributeMapper.class);
//FileInputFormat.addInputPath(job, new Path(otherArgs[0])); //expecting a folder instead of a file
if(JOB_RUNNING_LOCAL)
FileOutputFormat.setOutputPath(job, new Path(args[2]));
else
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.setPartitionerClass(KeyPartitioner.class);
job.setGroupingComparatorClass(TextPair.FirstComparator.class);
job.setReducerClass(LookupReducer.class);
job.setMapOutputKeyClass(TextPair.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Joiner(), args);
System.exit(exitCode);
}
}
public class Attribute {
public static final String ATT_TAG = "1";
public static class AttributeMapper
extends Mapper<LongWritable, Text, TextPair, Text>{
private static Text values = new Text();
//private Object output = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//partition the input line by the separator semicolon
String[] attributes = value.toString().split(";");
String valuesInString = "";
if(attributes.length != 5)
System.err.println("Input column number not correct. Expected 5, provided " + attributes.length
+ "\n" + "Check the input file");
if(attributes.length == 5)
{
//setting the values with the input values read above
valuesInString = attributes[1] + ";" + attributes[2] + ";" + attributes[3] + ";" + attributes[4];
values.set(valuesInString);
//writing out the key and value pair
context.write( new TextPair(new Text(String.valueOf(attributes[0])), new Text(ATT_TAG)), values);
}
}
}
}
public class CDR {
public static final String CDR_TAG = "0";
public static class CDRMapper
extends Mapper<LongWritable, Text, TextPair, Text>{
private static Text values = new Text();
private Object output = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//partition the input line by the separator semicolon
String[] cdr = value.toString().split(";");
//setting the values with the input values read above
values.set(cdr[1]);
//output = CDR_TAG + cdr[1];
//writing out the key and value pair
context.write( new TextPair(new Text(String.valueOf(cdr[0])), new Text(CDR_TAG)), values);
}
}
}
I would be glad if anyone could at least post a link for a tutorial or a simple example where such a join functionality is implemented. I searched a lot, but either the code was not complete or there was not enough explanation.
To be honest, I have no idea what your code is trying to do, but that's probably because I'd do it in a different way and not familiar with the API's you're using.
I would start from scratch as follows:
Create a mapper to read one of the files. For each line read, write a key value pair to the context. The key is a Text created from the key and the value is another Text created by concatenating a "1" with the entire input record.
Create another mapper for the other file. This mapper acts just like the first mapper, but the value is a Text created by concatenating a "2" with the entire input record.
Write a reducer to do the join. The reduce() method will get all records written for a specific key. You can tell which input file (and therefore the data format for the record) by looking to see whether the value starts with a "1" or a "2". Once you know whether or not you have one, the other or both record types, you can write whatever logic you need to merge the data from the one or two records.
By the way, you use the MultipleInputs class to configure more than one mapper in your job/driver class.

Categories