multiple inputs and grouping comparator - java

I have inputs from two sources:
map output in the form,
output.collect(new StockKey(Text(x+" "+id), new Text(id2)), new Text(data));
map output in the form,
output.collect(new StockKey(new Text(x+" "+id), new Text(1), new Text(data));
Job conf:
conf.setPartitionerClass(CustomPartitioner.class);
conf.setValueGroupingComparatorClass(StockKeyGroupingComparator.class);
where StockKey is a custom class of format (new Text(), new Text());
Constructor:
public StockKey(){
this.symbol = new Text();
this.timestamp = new Text();
}
Grouping comparator:
public class StockKeyGroupingComparator extends WritableComparator {
protected StockKeyGroupingComparator() {
super(StockKey.class, true);
}
public int compare(WritableComparable w1, WritableComparable w2){
StockKey k1 = (StockKey)w1;
StockKey k2 = (StockKey)w2;
Text x1 = new Text(k1.getSymbol());
Text x2 = new Text(k2.getSymbol());
return x1.compareTo(x2);
}
}
But I'm not receiving the map output values from input
I'm getting only the map output value reaches the reducer. I want the the records which have the symbol viz new Text(x+" "+id) which are common from both the map outputs to be grouped to the same reducer. I am struck here.
Please help!

To do this you need a Partitioner which fits in as follows:
Your mappers output a bunch of records as key/value pairs
For each record, the partitioner is passed the key, the value and the number of reducers. The partitioner decides which reducer will handle the record
The records are shipped off to their respective partitions (reducers)
The GroupingComparator is run to decide which key value pairs get grouped into an iterable for a single call to the reducer() method
and so on...
I think the default partitioner is choosing the reducer partition for each record based on the entire value of your key (that's the default behavior). But you want records grouped by only part of the key (just the symbol and not the symbol and timestamp). So you need to write a partitioner that does this and specify/configure it in the driver class.
Once you do that, you're grouping comparator should help group the records as you've intended.
EDIT: random thoughts
You might make things easier on yourself if you moved the timestamp to the value, making the key simple (just the symbol) and the value complex (timestamp and value). Then you wouldn't need a partitioner or a grouping comparator.
You didn't say either way, but you did use the MultipleInputs class, right? That's the only way to invoke two or more mappers for the same job.

Related

Partition Strategy for applying multiple JOINs on a Flink DataSet

I am using Flink 1.4.0.
Suppose I have a POJO as follows:
public class Rating {
public String name;
public String labelA;
public String labelB;
public String labelC;
...
}
and a JOIN function:
public class SetLabelA implements JoinFunction<Tuple2<String, Rating>, Tuple2<String, String>, Tuple2<String, Rating>> {
#Override
public Tuple2<String, Rating> join(Tuple2<String, Rating> rating, Tuple2<String, String> labelA) {
rating.f1.setLabelA(labelA)
return rating;
}
}
and suppose I want to apply a JOIN operation to set the values of each field in a DataSet<Tuple2<String, Rating>>, which I can do as follows:
DataSet<Tuple2<String, Rating>> ratings = // [...]
DataSet<Tuple2<String, Double>> aLabels = // [...]
DataSet<Tuple2<String, Double>> bLabels = // [...]
DataSet<Tuple2<String, Double>> cLabels = // [...]
...
DataSet<Tuple2<String, Rating>>
newRatings =
ratings.leftOuterJoin(aLabels, JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE)
// key of the first input
.where("f0")
// key of the second input
.equalTo("f0")
// applying the JoinFunction on joining pairs
.with(new SetLabelA());
Unfortunately, this is necessary as both ratings and all xLabels are very big DataSets and I am forced to look into each of the xlabels to find the field values I require, while at the same time it is not the case that all rating keys exist in each xlabels.
This practically means that I have to perform a leftOuterJoin per xlabel, for which I need to also create the respective JoinFunction implementation that utilises the correct setter from the Rating POJO.
Is there a more efficient way to solve this that anyone can think of?
As far as the partitioning strategy goes, I have made sure to sort the DataSet<Tuple2<String, Rating>> ratings with:
DataSet<Tuple2<String, Rating>> sorted_ratings = ratings.sortPartition(0, Order.ASCENDING).setParallelism(1);
By setting parallelism to 1 I can be sure that the whole dataset will be ordered. I then use .partitionByRange:
DataSet<Tuple2<String, Rating>> partitioned_ratings = sorted_ratings.partitionByRange(0).setParallelism(N);
where N is the number of cores I have on my VM. Another side question I have here is whether the first .setParallelism which is set to 1 is restrictive in terms of how the rest of the pipeline is executed, i.e. can the follow up .setParallelism(N) change how the DataSet is processed?
Finally, I did all these so that when partitioned_ratings is joined with a xlabels DataSet, the JOIN operation will be done with JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE. According to Flink docs for v.1.4.0:
REPARTITION_SORT_MERGE: The system partitions (shuffles) each input (unless the input is already partitioned) and sorts each input (unless it is already sorted). The inputs are joined by a streamed merge of the sorted inputs. This strategy is good if one or both of the inputs are already sorted.
So in my case, ratings is sorted (I think) and each of the xlabels DataSets are not, hence it makes sense that this is the most efficient strategy. Anything wrong with this? Any alternative approaches?
So far I haven't been able to pull through this strategy. It seems like relying on JOINs is too troublesome as they are expensive operations and one should avoid them unless they are really necessary.
For instance, JOINs should be used if both Datasets are very big in size. If they are not, a convenient alternative is the use of BroadCastVariables by which one of the two Datasets (the smallest), is broadcasted across workers for whatever purpose it is used. A example appears below (copied from this link for convenience)
DataSet<Point> points = env.readCsv(...);
DataSet<Centroid> centroids = ... ; // some computation
points.map(new RichMapFunction<Point, Integer>() {
private List<Centroid> centroids;
#Override
public void open(Configuration parameters) {
this.centroids = getRuntimeContext().getBroadcastVariable("centroids");
}
#Override
public Integer map(Point p) {
return selectCentroid(centroids, p);
}
}).withBroadcastSet("centroids", centroids);
Also, since populating fields of a POJO implies that a quite similar code will be leverage repeatedly, one should definitely use jlens to avoid code repetition and write a more concise and easy to follow solution.

Spark flatMap/reduce: How to scale and avoid OutOfMemory?

I am migrating some map-reduce code into Spark, and having problems when constructing an Iterable to return in the function.
In MR code, I had a reduce function that grouped by key, and then (using multipleOutputs) would iterate the values and use write (in multiple outputs, but that's unimportant) to some code like this (simplified):
reduce(Key key, Iterable<Text> values) {
// ... some code
for (Text xml: values) {
multipleOutputs.write(key, val, directory);
}
}
However, in Spark I have translated a map and this reduce into a sequence of:
mapToPair -> groupByKey -> flatMap
as recommended... in some book.
mapToPair basically adds a Key via functionMap, which based on some values on the record creates a Key for that record. Sometimes a key may have ver high cardinality.
JavaPairRDD<Key, String> rddPaired = inputRDD.mapToPair(new PairFunction<String, Key, String>() {
public Tuple2<Key, String> call(String value) {
//...
return functionMap.call(value);
}
});
The rddPaired is applied a RDD.groupByKey() to get the RDD to feed the flatMap function:
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.groupByKey();
Once grouped, a flatMap call to do the reduce. Here, operation is a transformation :
public Iterable<String> call (Tuple2<Key, Iterable<String>> keyValue) {
// some code...
List<String> out = new ArrayList<String>();
if (someConditionOnKey) {
// do a logic
Grouper grouper = new Grouper();
for (String xml : keyValue._2()) {
// group in a separate class
grouper.add(xml);
}
// operation is now performed on the whole group
out.add(operation(grouper));
} else {
for (String xml : keyValue._2()) {
out.add(operation(xml));
}
return out;
}
}
It works fine... with keys that don't have too many records. Actually, it breaks by OutOfMemory when a key with lot of values enters the "else" on the reduce.
Note: I have included the "if" part to explain the logic I want to produce, but the failure happens when entering the "else"... because when data enters the "else", it normally means there will be many more values for that due by the nature of the data.
It is clear that, having to keep all of the grouped values in "out" list, it won't scale if a key has millions of records, because it will keep them in memory. I have reached the point where the OOM happens (yes, it's when performing the "operation" above which asks for memory - and none is given. It's not a very expensive memory operation though).
Is there any way to avoid this in order to scale? Either by replicating behaviour using some other directives to reach the same output in a more scalable way, or to be able to hand to Spark the values for merging (just as I used to do with MR)...
It's inefficient to do condition inside the flatMap operation. You should check the condition outside to create 2 distinct RDDs and deal with them separatedly.
rddPaired.cache();
// groupFilterFunc will filter which items need grouping
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.filter(groupFilterFunc).groupByKey();
// processGroupedValuesFunction should call `operation` on group of all values with the same key and return the result
rddGrouped.mapValues(processGroupedValuesFunction);
// nogroupFilterFunc will filter which items don't need grouping
JavaPairRDD<Key, Iterable<String>> rddNoGrouped = rddPaired.filter(nogroupFilterFunc);
// processNoGroupedValuesFunction2 should call `operation` on a single value and return the result
rddNoGrouped.mapValues(processNoGroupedValuesFunction2);

Hadoop Mapreduce : values to reducer are in reverse order

I will be doing the following in a much bigger file. for now,I have an example input file with the following values.
1000,SMITH,JERRY
1001,JOHN,TIA
1002,TWAIN,MARK
1003,HARDY,DENNIS
1004,CHILD,JACK
1005,CHILD,NORTON
1006,DAVIS,JENNY
1007,DAVIS,KAREN
1008,MIKE,JOHN
1009,DENNIS,SHERIN
now what i am doing is running a mapreduce job to encrypt the last name of each record and write back an output. and i am using the mapper partition number as the key and the modified text as value.
so the output from mapper will be,
0 1000,Mj4oJyk=,,JERRY
0 1001,KzwpPQ,TIA
0 1002,NSQgOi8,MARK
0 1003,KTIzNzg,DENNIS
0 1004,IjsoPyU,JACK
0 1005,IjsoPyU,NORTON
0 1006,JTI3OjI,JENNY
0 1007,JTI3OjI,KAREN
0 1008,LDoqNg,JOHN
0 1009,JTYvPSgg,SHERIN
I don't want any sorting to be done.I also use a reducer because, in case of a larger file, there will be multiple mappers and if no reducer, multiple output files will be written. so i use a single reduce to merge values from all mappers and write to single file.
now the input values to reducer comes in reversed order and in the order from mapper. it is like the following,
1009,JTYvPSgg,SHERIN
1008,LDoqNg==,JOHN
1007,JTI3OjI=,KAREN
1006,JTI3OjI=,JENNY
1005,IjsoPyU=,NORTON
1004,IjsoPyU=,JACK
1003,KTIzNzg=,DENNIS
1002,NSQgOi8=,MARK
1001,KzwpPQ==,TIA
1000,Mj4oJyk=,JERRY
Why is it reversing the order? and how can i maintain the same order from mapper? any suggestions will be helpfull
EDIT 1 :
the Driver code is,
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJobName("encrypt");
job.setJarByClass(TestDriver.class);
job.setMapperClass(TestMap.class);
job.setNumReduceTasks(1);
job.setReducerClass(TestReduce.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(hdfsInputPath));
FileOutputFormat.setOutputPath(job, new Path(hdfsOutputPath));
System.exit(job.waitForCompletion(true) ? 0 : 1);
the mapper code is,
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
TaskID taskId = context.getTaskAttemptID().getTaskID();
int partition = taskId.getId();
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
mask(inputvalues);
context.write(new IntWritable(partition), new Text(stringBuilder.toString()));
The reducer code is,
for(Text value : values) {
context.write(new Text(value), null);
}
The base idea of MapReduce is that the order in which things are done is irrelevant.
So you cannot (and do not need to) control the order in which
the input records go through the mappers.
the key and related values go through the reducers.
The only thing you can control is the order in which the values are placed in the iterator that is made available in the reducer.
For that you can use the Object key to maintain the order of values.
The LongWritable part (or the key) is the position of the line in the file (Not line number, but position from start of file).
You can use that part to keep track of which line was first.
Then your mapper code will be changed to
protected void map(Object key, Text value, Mapper<Object, Text, LongWritable, Text>.Context context)
throws IOException, InterruptedException {
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
mask(inputValues);
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
context.write(new LongWritable(((LongWritable) key).get()), value);
}
Note: you can change all IntWritable to LongWritable in your code but be careful.
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
TaskID taskId = context.getTaskAttemptID().getTaskID();
//preserve the number value for sorting
IntWritable idNumber = new IntWritable(Integer.parseInt(inputValue[0])
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
mask(inputvalues);
context.write(idNumber, new Text(stringBuilder.toString()));
I made some assumptions because you did not have the full code of the mapper. I assumed that inputValues was a string array due to the toString() output. The first value of the array should be the number value from your input, however it is now a string. You must convert the number back to IntWritable to match what your mapper is emitting IntWritable,Text. The hadoop framework will sort by key and with the key being of type IntWritable it will sort in ascending order. The code you provided is using the task ID and from reading the API https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/TaskAttemptID.html#getTaskID() It was unclear whether this would provide an order to your values as you desired. To control the order of output I would recommend using the first value of your string array and convert to IntWritable. I don't know if this violates your intent to mask the inputValues.
EDIT
To follow up with your comment. You can simply multiply the partition by -1 this will cause the hadoop framework to reverse the order.
int partition = -1*taskId.getId();

Hadoop - Explicit grouping of Mapper result

I have written the mapper code in which the key is emitted as IntTextPair, I want to group the mapper result by just Int from the IntTextPair, like
[1 Shanghai]
[1 Test]
[2 Set]
and the mapper result should be grouped as:
[1 Shanghai, Test]
[2 Set]
I have implemented the Comparator class:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class GroupByInput extends WritableComparator {
public GroupByInput() {
super(IntTextPair.class, true);
}
#Override
public int compare(WritableComparable it1, WritableComparable it2) {
IntTextPair Pair1 = (IntTextPair) it1;
IntTextPair Pair2 = (IntTextPair) it2;
return Pair1.getFirst().compareTo(Pair2.getFirst());
}
}
and in the configuration file I have set comparator class like this:
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setGroupingComparatorClass(GroupByInput.class);
Am I going to the right direction? I need some assistance
You can't merge/consolidate the key's as you currently outlined. What's the current Mapper output value type/class - is there a reason why you can't output an KV from the mapper?
If you do have another class / type currently being output from the mapper as the Value component, then you can still somewhat achieve this by:
Your GroupComparator looks good, paired with the ordering of IntTextPair means that all keys with the same Int component will be sent to the same reducer.
In your reducer, as you iterate the values you can examine the Key to determine the unique list of Text components of the key.
it's not very well known that as you iterate the values in the reducer, the contents of the key are updated too - with your grouper the Int component will always be the same for a particular reduce run, but the Text component can change
As the keys are ordered, you can keep track of the previous Text component value (be sure to COPY the contents before you iterate to the next value in the values iterable)

Best way to save some data and then retrieve it

I have a project where I save some data coming from different channels of a Soap Service, for example:
String_Value Long_timestamp Double_value String_value String_value Int_value
I can have many lines (i.e. 200), with different values, like the one above.
I thought that I could use an ArrayList, however data can have a different structure than the one above, so an ArrayList maybe isn't a good solution in order to retrieve data from it.
For example above I have, after the first two values that are always fixed, 4 values, but in another channel I may have 3, or 5, values. What I want retrieve data, I must know how many values have a particular line, and I think that Arraylist doesn't help me.
What solution could I use?
When you have a need to uniquely identify varying length input, a HashMap usually works quite well. For example, you can have a class:
public class Record
{
private HashMap<String, String> values;
public Record()
{
// create your hashmap.
values = new HashMap<String, String>();
}
public String getData(String key)
{
return values.get(key);
}
public void addData(String key, String value)
{
values.put(key, value);
}
}
With this type of structure, you can save as many different values as you want. What I would do is loop through each value passed from Soap and simply add to the Record, then keep a list of Record objects.
Record rec = new Record();
rec.addData("timestamp", timestamp);
rec.addData("Value", value);
rec.addData("Plans for world domination", dominationPlans);
You could build your classes representing the entities and then build a parser ... If it isn't in a standard format (eg JSON, YAML, ecc...) you have no choice to develop your own parser .
Create a class with fields.
class ClassName{
int numberOfValues;
String dataString;
...
}
Now create an ArrayList of that class like ArrayList<ClassName> and for each record fill that class object with numberOfValues and dataString and add in Arraylist.

Categories