Hadoop - Explicit grouping of Mapper result

Hadoop - Explicit grouping of Mapper result - java

I have written the mapper code in which the key is emitted as IntTextPair, I want to group the mapper result by just Int from the IntTextPair, like
[1 Shanghai]
[1 Test]
[2 Set]
and the mapper result should be grouped as:
[1 Shanghai, Test]
[2 Set]
I have implemented the Comparator class:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class GroupByInput extends WritableComparator {
public GroupByInput() {
super(IntTextPair.class, true);
}
#Override
public int compare(WritableComparable it1, WritableComparable it2) {
IntTextPair Pair1 = (IntTextPair) it1;
IntTextPair Pair2 = (IntTextPair) it2;
return Pair1.getFirst().compareTo(Pair2.getFirst());
}
}
and in the configuration file I have set comparator class like this:
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setGroupingComparatorClass(GroupByInput.class);
Am I going to the right direction? I need some assistance

You can't merge/consolidate the key's as you currently outlined. What's the current Mapper output value type/class - is there a reason why you can't output an KV from the mapper?
If you do have another class / type currently being output from the mapper as the Value component, then you can still somewhat achieve this by:
Your GroupComparator looks good, paired with the ordering of IntTextPair means that all keys with the same Int component will be sent to the same reducer.
In your reducer, as you iterate the values you can examine the Key to determine the unique list of Text components of the key.
it's not very well known that as you iterate the values in the reducer, the contents of the key are updated too - with your grouper the Int component will always be the same for a particular reduce run, but the Text component can change
As the keys are ordered, you can keep track of the previous Text component value (be sure to COPY the contents before you iterate to the next value in the values iterable)

Related

Java Streams with combining multiple rows to one

My code consists of a class with 10 variables. The class will get the data from a database table and the results from it is a List. Here is a sample class:
#Data
class pMap {
long id;
String rating;
String movieName;
}
The data will be as follows:
id=1, rating=PG-13, movieName=
id=1, rating=, movieName=Avatar
id=2, rating=, movieName=Avatar 2
id=2, rating=PG, movieName=
I want to combine both the rows to a single row grouping by id using Java streams. The end result should like this Map<Long, pMap>:
1={id=1, rating=PG-13, movieName=Avatar},
2={id=2, rating=PG, movieName=Avatar 2}
I am not sure how I can get the rows combined to one by pivoting them.

You can use toMap to achieve this:
Map<Long, pMap> myMap = myList.stream().collect(Collectors.toMap(x -> x.id, Function.identity(),
(x1, x2) -> new pMap(x1.id, x1.rating != null ? x1.rating : x2.rating, x1.movieName != null ? x1.movieName : x2.movieName)));
I am passing two functions to toMap method:
First one is a key mapper. It maps an element to a key of the map. In this case, I want the key to be the id.
The second one is a value mapper. I want the value to be the actual pMap so this is why I am passing the identity function.
The third argument is a merger function that tells how to merge two values with the same id.

Best way to save some data and then retrieve it

I have a project where I save some data coming from different channels of a Soap Service, for example:
String_Value Long_timestamp Double_value String_value String_value Int_value
I can have many lines (i.e. 200), with different values, like the one above.
I thought that I could use an ArrayList, however data can have a different structure than the one above, so an ArrayList maybe isn't a good solution in order to retrieve data from it.
For example above I have, after the first two values that are always fixed, 4 values, but in another channel I may have 3, or 5, values. What I want retrieve data, I must know how many values have a particular line, and I think that Arraylist doesn't help me.
What solution could I use?

When you have a need to uniquely identify varying length input, a HashMap usually works quite well. For example, you can have a class:
public class Record
{
private HashMap<String, String> values;
public Record()
{
// create your hashmap.
values = new HashMap<String, String>();
}
public String getData(String key)
{
return values.get(key);
}
public void addData(String key, String value)
{
values.put(key, value);
}
}
With this type of structure, you can save as many different values as you want. What I would do is loop through each value passed from Soap and simply add to the Record, then keep a list of Record objects.
Record rec = new Record();
rec.addData("timestamp", timestamp);
rec.addData("Value", value);
rec.addData("Plans for world domination", dominationPlans);

You could build your classes representing the entities and then build a parser ... If it isn't in a standard format (eg JSON, YAML, ecc...) you have no choice to develop your own parser .

Create a class with fields.
class ClassName{
int numberOfValues;
String dataString;
...
}
Now create an ArrayList of that class like ArrayList<ClassName> and for each record fill that class object with numberOfValues and dataString and add in Arraylist.

multiple inputs and grouping comparator

I have inputs from two sources:
map output in the form,
output.collect(new StockKey(Text(x+" "+id), new Text(id2)), new Text(data));
map output in the form,
output.collect(new StockKey(new Text(x+" "+id), new Text(1), new Text(data));
Job conf:
conf.setPartitionerClass(CustomPartitioner.class);
conf.setValueGroupingComparatorClass(StockKeyGroupingComparator.class);
where StockKey is a custom class of format (new Text(), new Text());
Constructor:
public StockKey(){
this.symbol = new Text();
this.timestamp = new Text();
}
Grouping comparator:
public class StockKeyGroupingComparator extends WritableComparator {
protected StockKeyGroupingComparator() {
super(StockKey.class, true);
}
public int compare(WritableComparable w1, WritableComparable w2){
StockKey k1 = (StockKey)w1;
StockKey k2 = (StockKey)w2;
Text x1 = new Text(k1.getSymbol());
Text x2 = new Text(k2.getSymbol());
return x1.compareTo(x2);
}
}
But I'm not receiving the map output values from input
I'm getting only the map output value reaches the reducer. I want the the records which have the symbol viz new Text(x+" "+id) which are common from both the map outputs to be grouped to the same reducer. I am struck here.
Please help!

To do this you need a Partitioner which fits in as follows:
Your mappers output a bunch of records as key/value pairs
For each record, the partitioner is passed the key, the value and the number of reducers. The partitioner decides which reducer will handle the record
The records are shipped off to their respective partitions (reducers)
The GroupingComparator is run to decide which key value pairs get grouped into an iterable for a single call to the reducer() method
and so on...
I think the default partitioner is choosing the reducer partition for each record based on the entire value of your key (that's the default behavior). But you want records grouped by only part of the key (just the symbol and not the symbol and timestamp). So you need to write a partitioner that does this and specify/configure it in the driver class.
Once you do that, you're grouping comparator should help group the records as you've intended.
EDIT: random thoughts
You might make things easier on yourself if you moved the timestamp to the value, making the key simple (just the symbol) and the value complex (timestamp and value). Then you wouldn't need a partitioner or a grouping comparator.
You didn't say either way, but you did use the MultipleInputs class, right? That's the only way to invoke two or more mappers for the same job.

In Java does a data type like this exist?

I found this in redis and was trying to see if there was anything similar in Java. Say I have the following data:
3:1
3:2
3:3
4:1
As you can see non of the data points on their own are unique but the combination is unique. There is a command in redis:
sadd 3 1 2
sadd 3 3
sadd 4 1
That would give me a something like this:
3 -> 1, 2, 3
4 -> 1
By doing something like smembers 3(this would return everything for 3) or smembers 3 2(this would return if the subvalue exists).
I'm wondering what the closest I can get to this functionality within Java?

The Guava MultiMap interface does exactly this. Note that it's up to the particular implementation whether it allows duplicate <K,V> pairs. It sounds like you're after one where the K,V pairs are always unique. If so, have a look at the HashMultimap class.
However, if you want to roll your own, you're probably looking for a combination of Map and Set: Map<Integer,Set<Integer>>
When you add (key, value) elements to the map:
First check to see whether the key is in there. If not, you'll need to add an empty Set<Integer>.
Then, do map.get(key).put(value);
When you want to retrieve all elements with a specific key:
do map.get(key) and iterate over the result
When you want to see if a specific key/value pair is in there:
do if(map.containsKey(key) && map.get(key).contains(value))
For extra credit, you could implement all of this inside a wrapper. The ForwardingMap from guava might be a good place to start.

You can create your own class MultivalueMap like this:
import java.util.Set;
import java.util.Map;
import java.util.HashMap;
import java.util.List;
import java.util.ArrayList;
public class MultiValueMap<T1, T2> {
public Map<T1, List<T2>> map = null;
public MultiValueMap(){
this.map = new HashMap();
}
public void putList(T1 key, List<T2> list){
map.put(key, list);
}
public void put(T1 key, T2 value){
List<T2> list = null;
if(map.get(key) == null){
list = new ArrayList<T2>();
map.put(key, list);
}
else {
list = map.get(key);
}
list.add(value);
}
public List<T2> get(T1 key){
return map.get(key);
}
public Set<T1> keySet(){
return map.keySet();
}
public Map getMap(){
return this.map;
}
public boolean contains(T1 key, T2 listValue){
List<T2> list = map.get(key);
return list.contains(listValue);
}
}

There's nothing directly built into Java. You could use something like a
Map<Integer, Set<Integer>>
to store the relationships. This kind of construct is discussed in the Java tutorial on the Map interface. You could also use something like Guava's MultiMap<Integer, Integer>.

From wikipedia:
In its outer layer, the Redis data model is a dictionary where keys
are mapped to values.
In other words, just use a Map to store key value pairs. Note that the Map is only an interface. You will need to create a Map object using subclasses which implement the Map interface e.g, HashMap, TreeMap, etc. I think you're confused between the datastructure itself and the implementation of its methods. Those functions which you mentioned can be easily implemented in Java.

you can achieve this by using Collections Framework in Java.
as you are looking for key-value pairs to be stored.
you can use Map and Set in java.
Map<Integer ,Set<Integer>>

Hadoop Looping the Reducer

I am trying to find a way to "loop" my reducer, for example:
for(String document: tempFrequencies.keySet())
{
if(list.get(0).equals(document))
{
testMap.put(key.toString(), DF.format(tfIDF));
}
}
//This allows me to create a hashmap which i plan to write out to context as Filename = key then all of the terms weights = value (a list I can parse out in the next job)
The code currently will run through the entire reduce and give me what I want for list.get(0) but the problem is once it is finished doing that entire reduce I need it to start again for list.get(1) etc. Any ideas on how to loop the reduce phase after it has finished?

Nest the for loop
for(int i = 0; i < number_of_time; i++){
//your code
}
Replace the 0 with i.

You can use key-tag-value technique.
In mapper emit (key, 0, value) for list values and (key, 1, value) for documents (?). In reducer values will be grouped by key and tag and sorted by tag for each key. You should write your own grouping comparator (and custom partitioner).
PS I am using the same techique for graph processing. I can provide sample code after weekend.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop - Explicit grouping of Mapper result - java

Related

Java Streams with combining multiple rows to one

Best way to save some data and then retrieve it

multiple inputs and grouping comparator

In Java does a data type like this exist?

Hadoop Looping the Reducer

Categories

Resources