Creating all pairs of list of values in hadoop - java

I have a small map-reduce program I'm writing for hadoop, one element of the program is to create all pairs of a list. For example if the input for the program is:
item1 tag1
item2 tag1
item3 tag2
item4 tag1
item5 tag2
My map function creates a <tag, item> pair, so the reducer receives <tag, List<item>> as its input. My goal is for the output from the reducer to be:
item1-item2 tag1
item1-item4 tag1
item2-item4 tag1
item3-item5 tag2
So essentially, for every list of values, to create all the possible pairs, and make each pair a key.
I have found a solution that works, but it relies on copying the list into memory and iterating over it. This might be a problem since my dataset can be very large:
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
List<String> list = new ArrayList<String>();
for (Text t : values) {
list.add(t.toString());
}
for (int i=0; i<list.size()-1; i++) {
for (int j=i+1; j<list.size(); j++) {
out.set(list.get(i) + "-" + list.get(j))
context.write(out, one);
}
}
}
Is there an alternative, or more efficient way of doing it in hadoop?
I do not want to be copying each list into memory.
I've been trying to come up with something creative like using another map-reduce step, but cannot seem to find something that works.
Thank you!

The reducer does get all of that data, but that data is actually written to disk and is only brought into memory as you iterate through the Iteratable of values. In fact, the object that is returned by that iteration is reused for each value: the fields and other state are simply replaced before the object is handed to you.
That means you have to explicitly copy the value object in order to have all value objects in memory at the same time.
When i look at your code it seems you are not saving the item pairs in memory. You are writing out the item pairs directly so you should be good.

Related

Get total for unique items from Map in java

I currently have a map which stores the following information:
Map<String,String> animals= new HashMap<String,String>();
animals.put("cat","50");
animals.put("bat","38");
animals.put("dog","19");
animals.put("cat","31");
animals.put("cat","34");
animals.put("bat","1");
animals.put("dog","34");
animals.put("cat","55");
I want to create a new map with total for unique items in the above map. So in the above sample, count for cat would be 170, count for bat would be 39 and so on.
I have tried using Set to find unique animal entries in the map, however, I am unable to get the total count for each unique entry
First, don't use String for arithmetic, use int or double (or BigInteger/BigDecimal, but that's probably overkill here). I'd suggest making your map a Map<String, Integer>.
Second, Map.put() will overwrite the previous value if the given key is already present in the map, so as #Guy points out your map actually only contains {cat:55, dog:34, bat:1}. You need to get the previous value somehow in order to preserve it.
The classic way (pre-Java-8) is like so:
public static void putOrUpdate(Map<String, Integer> map, String key, int value) {
Integer previous = map.get(key);
if (previous != null) {
map.put(key, previous + value);
} else {
map.put(key, value);
}
}
Java 8 adds a number of useful methods to Map to make this pattern easier, like Map.merge() which does the put-or-update for you:
map.merge(key, value, (p, v) -> p + v);
You may also find that a multiset is a better data structure to use as it handles incrementing/decrementing for you; Guava provides a nice implementation.
As Guy said. Now you have one bat, one dog and one cat. Another 'put's will override your past values. For definition. Map stores key-value pairs where each key in map is unique. If you have to do it by map you can sum it just in time. For example, if you want to add another value for cat and you want to update it you can do it in this way:
animals.put("cat", animals.get("cat") + yourNewValue);
Your value for cat will be updated. This is for example where our numbers are float/int/long, not string as you have. If you have to do it by strings you can use in this case:
animals.put("cat", Integer.toString(Integer.parseInt(animals.get("cat")) + yourNewValue));
However, it's ugly. I'd recommend create
Map<String, Integer> animals = new HashMap<String, Integer>();

Best way to implement a Reverse MultiMap

My Java8 program has several stages:
A CSV file is parsed. The CSV file looks like this:
123,[Foo:true; Bar:true; Foobar:false; Barfoo:false]
456,[Foobar:true; Barfoo:false; Foo:false; Bar:false]
789,[Foobar:true; Barfoo:false; Foo:false]
where 123, 546 and 789 are unique identifiers of each datastructure, one datastructure is represented by the column identifiers Foo Bar Foobar and Barfoo, where each boolean indicates, if the column is a key-colum.
While the CSV file is parsed, for each line a datastructure must be created.
Later, in runtime, wich needs to be fast, the following will happen:
An ArrayList<String> containing column data is given. Data needs to be added to a specific datastructure. (I do now the unique identifier 123).
Say: the ArrayList<String> needs to be added to 123: 1-> foo, 2->bar, 3-> foobar, 4 ->barfoo.
Say: another ArrayList<String> needs to be added to 456: 1-> foobar, 2-> barfoo
Say: another ArrayList<String> needs to be added to 789: 1-> foobar, 2-> null, 3-> foo
The tasks that the datastructure needs to provide are the following:
add(ArrayList<String>) : void
remove(ArrayList<String>) : boolean (if successfull)
contains(Arraylist<String>) : boolean
get(ArrayList<String>) : ArrayList<String>
Notes:
The combination of all keys inside of one datastructure 123 are unique. Meaning: If in 123 is one entry with foo,bar,foobar, barfoo, Another enrty with foo,bar,doesnt matter, neither will not be allowed. Another enrty with foo1,bar, foobar,barfoo is allowed, as well as an entry with foo1,bar1,foobar, barfoo is also allowed.
It won't happen, that wile parsing, a column name not beeing a key (true), is in front of a key. This will not happen:
[Foobar:true;Barfoo:false;Foo:true;Bar:false]
It won't happen, at runtime, that a column marked as a key will not get data: This will not happen: an ArrayList<String> added to 123 with data looks like this: 1->foo, 2->null, 3->foobar.
I tried: storing at each datastructure-class two arrays. One with the Column Names, and one with Numbers of the Columns, which are keys. At runtime the key-indicating array will be processed to get all key values (at the first add example above it would be foo and bar) and they will be concatenated. (to a String "foo,bar"). This is a new key for a second HashMap<String,ArrayList<String>> where the value (ArrayList<String>) contains the data of all columns (foo,bar,foobar,barfoo).
I have a getKeyString method:
String getKeyString(ArrayList<String> keys, ArrayList<Integer> keyPos){
if (keyPos.get(keyPos.size()-1) >= keys.size()) //if the last entry from orders arraylist keyPos is greater than size of keys
throw new Exception();
String collect = keyPos.stream().map(i -> keys.get(i))
.map(string ->{
try{
if(string.equals("null")) // happens not very often, ~1time in 1,000
return "";
}
catch(NullPointerException e) { //happens even less 1 in 100,000
return "";
}
return string;
})
.collect(Collectors.joining(","));
if(collect.length()<keyPos.size())
throw new Exception("results in an empty key: ");
return collect;
and the addDataListEnty looks quite similar to this:
HashMap<String, ArrayList<String>> dataLists = new HashMap<>();
ArrayList<Integer> keyPos = new ArrayList<>();
...
public void addDataListEntry(ArrayList<String> values) {
// will overwrite Entry if it already exists
try {
this.dataLists.put(getKeyString(values, keyPos), values);
} catch (Exception e) {
logger.info(e.getMessage());
}
}
This does actually work, but is really slow, since the key (foo,bar) needs to created at every datastructure-access.
Which combination of HashMaps, Lists, Sets, (even Google Guava) is the best to make it as fast as possible?

What Java data structure is best for two-way multi-value mapping

I'm relatively new to Java and I have a question about what type of data structure would be best for my case. I have a set of data which are essentially key-value pairs, however each value may correspond to multiple keys and each key may correspond to multiple values. A simplified example would be:
Red-Apple
Green-Apple
Red-Strawberry
Green-Grapes
Purple-Grapes
Considering the above example, I need to be able to return what color apples I have and/or what red fruits I have. The actual data will generated dynamically based upon an input file where each set will be anywhere from 100-100,000 values and each value may correspond to hundreds of values in the other set.
What would be the most efficient way to store and parse this data? I would prefer a solution as native to java as possible rather than something such as an external database.
This question is related, but I'm not sure how to apply the solution in my case given that I would need to assign multiple values to each key in both directions.
As you can't have duplicate keys in a Map, you can rather create a Map<Key, List<Value>>, or if you can, use Guava's Multimap.
Multimap<String, String> multimap = ArrayListMultimap.create();
multimap.put("Red", "Apple");
multimap.put("Red", "Strawberry");
System.out.println(multimap.get("Red")); // Prints - [Apple, Strawberry]
But the problem is you can't ask for the keys of a given object, I'll keep looking and make and edit if I find something else, hope it helps.
Still, you can make the reverse yourself by iterating the map and finding the keys for the object.
I suggest you use Guava's Table structure. Use color as your row keys and fruit as your column key or the other way round. Specifically, HashBasedTable is well suited for your case.
As per your use case, you wouldn't need to store anything for the values. However, these Tables don't allow null values. You could use a dummy Boolean or any other statistical useful value, i.e. date and time of insertion, user, number of color/fruit pairs, etc.
Table has the methods you need, such as column() and row(). Bear in mind that the docs say that these structures are optimized for row access. This might be OK for you if you plan to access by one key more than by the other.
You can create your own custom data structure
public class MultiValueHashMap<K, V> {
private HashMap<K, ArrayList<V>> multivalueHashMap = new HashMap<K, ArrayList<V>>();
public static void main(String[] args) {
MultiValueHashMap<String, String> multivaluemap = new MultiValueHashMap<String, String>();
multivaluemap.put("Red", "Apple");
multivaluemap.put("Green", "Apple");
multivaluemap.put("Red", "Strawberry");
multivaluemap.put("Green", "Grapes");
multivaluemap.put("Purple", "Grapes");
for(String k : multivaluemap.keySet()){
System.out.println(k + " : " + multivaluemap.get(k).toString());
}
}
public void put(K key, V value){
if (multivalueHashMap.containsKey(key)){
ArrayList<V> values = multivalueHashMap.get(key);
values.add(value);
}else{
ArrayList<V> values = new ArrayList<V>();
values.add(value);
multivalueHashMap.put(key, values);
}
}
public Set<K> keySet(){
return multivalueHashMap.keySet();
}
public ArrayList<V> get(K key){
return multivalueHashMap.get(key);
}
}
The output should be
Red : [Apple, Strawberry]
Purple : [Grapes]
Green : [Apple, Grapes]

Fast aggregation of multiple ArrayLists into a single one

I have the following list:
List<ArrayList> list;
list.get(i) contains the ArrayList object with the following values {p_name=set1, number=777002}.
I have to create a
Map<key,value>
where the key contains the p_name, and values are the numbers.
How to do it easily and fast as there can be hundreds of entries in the initial list and each number can be present in multiple p_name entries.
Update: Here is my current solution
List<Row> list; //here is my data
Map<String,String> map = new TreeMap<String,String>();
for (Row l : list) {
if (l.hasValues()) {
Map<String, String> values = l.getResult(); // internal method of Row interface that returns a map
String key = values.get( "number");
map.put(key, values.get( "p_name" ));
}
}
The method works, but maybe it could be done better?
PS : There is an obvious error in my design. I wonder if you find it :)
Sine the key can have more then one values, what you are looking for is a MultiMap. Multimap
Or a simple map in the form
Map<Key,ArrayList<Values>>
There is no "fast" way here to me. You still need to iterate through all the elements and check all the values.
And actually hundreds to Java is not much at all

Hadoop Looping the Reducer

I am trying to find a way to "loop" my reducer, for example:
for(String document: tempFrequencies.keySet())
{
if(list.get(0).equals(document))
{
testMap.put(key.toString(), DF.format(tfIDF));
}
}
//This allows me to create a hashmap which i plan to write out to context as Filename = key then all of the terms weights = value (a list I can parse out in the next job)
The code currently will run through the entire reduce and give me what I want for list.get(0) but the problem is once it is finished doing that entire reduce I need it to start again for list.get(1) etc. Any ideas on how to loop the reduce phase after it has finished?
Nest the for loop
for(int i = 0; i < number_of_time; i++){
//your code
}
Replace the 0 with i.
You can use key-tag-value technique.
In mapper emit (key, 0, value) for list values and (key, 1, value) for documents (?). In reducer values will be grouped by key and tag and sorted by tag for each key. You should write your own grouping comparator (and custom partitioner).
PS I am using the same techique for graph processing. I can provide sample code after weekend.

Categories