Hadoop Looping the Reducer - java

I am trying to find a way to "loop" my reducer, for example:
for(String document: tempFrequencies.keySet())
{
if(list.get(0).equals(document))
{
testMap.put(key.toString(), DF.format(tfIDF));
}
}
//This allows me to create a hashmap which i plan to write out to context as Filename = key then all of the terms weights = value (a list I can parse out in the next job)
The code currently will run through the entire reduce and give me what I want for list.get(0) but the problem is once it is finished doing that entire reduce I need it to start again for list.get(1) etc. Any ideas on how to loop the reduce phase after it has finished?

Nest the for loop
for(int i = 0; i < number_of_time; i++){
//your code
}
Replace the 0 with i.

You can use key-tag-value technique.
In mapper emit (key, 0, value) for list values and (key, 1, value) for documents (?). In reducer values will be grouped by key and tag and sorted by tag for each key. You should write your own grouping comparator (and custom partitioner).
PS I am using the same techique for graph processing. I can provide sample code after weekend.

Related

Java Stream - Retrieving repeated records from CSV

I searched the site and didn't find something similar. I'm newbie to using the Java stream, but I understand that it's a replacement for a loop command. However, I would like to know if there is a way to filter a CSV file using stream, as shown below, where only the repeated records are included in the result and grouped by the Center field.
Initial CSV file
Final result
In addition, the same pair cannot appear in the final result inversely, as shown in the table below:
This shouldn't happen
Is there a way to do it using stream and grouping at the same time, since theoretically, two loops would be needed to perform the task?
Thanks in advance.
You can do it in one pass as a stream with O(n) efficiency:
class PersonKey {
// have a field for every column that is used to detect duplicates
String center, name, mother, birthdate;
public PersonKey(String line) {
// implement String constructor
}
// implement equals and hashCode using all fields
}
List<String> lines; // the input
Set<PersonKey> seen = new HashSet<>();
List<String> unique = lines.stream()
.filter(p -> !seen.add(new PersonKey(p))
.distinct()
.collect(toList());
The trick here is that a HashSet has constant time operations and its add() method returns false if the value being added is already in the set, true otherwise.
What I understood from your examples is you consider an entry as duplicate if all the attributes have same value except the ID. You can use anymatch for this:
list.stream().filter(x ->
list.stream().anyMatch(y -> isDuplicate(x, y))).collect(Collectors.toList())
So what does the isDuplicate(x,y) do?
This returns a boolean. You can check whether all the entries have same value except the id in this method:
private boolean isDuplicate(CsvEntry x, CsvEntry y) {
return !x.getId().equals(y.getId())
&& x.getName().equals(y.getName())
&& x.getMother().equals(y.getMother())
&& x.getBirth().equals(y.getBirth());
}
I've assumed you've taken all the entries as String. Change the checks according to the type. This will give you the duplicate entries with their corresponding ID

Java 8 Streams for List iteration

I have a HashMap that contains List<Dto> and List<List<String>>:
Map<List<Dto>, List<List<String>>> mapData = new HashMap();
and an Arraylist<Dto>.
I want to iterate over this map, get the keys-key1, key2 etc and get the value out of it and set it to the Dto object and thereafter add it to a List. So i am able to successfully iterate using foreach and get it added to lists but not able to get it correctly done using Java 8. So i need some help on that. Here is the sample code
List<DTO> dtoList = new ArrayList();
DTO dto = new DTO();
mapData.entrySet().stream().filter(e->{
if(e.getKey().equals("key1")){
dto.setKey1(e.getValue())
}
if(e.getKey().equals("key2")){
dto.setKey2(e.getValue())
}
});
Here e.getValue() is from List<List<String>>()
so first thing is I need to iterate over it to set the value.
And second is I need to add dto to a Arraylist dtoList. So how to achieve this.
Basic Snippet that i tried without adding to a HashMap where List has keys, multiList has values and Dto list is where finally i add into
for(List<Dto> dtoList: column) {
if ("Key1".equalsIgnoreCase(column.getName())) {
index = dtoList.indexOf(column);
}
}
for(List<String> listoflists: multiList) {
if(listoflists.contains(index)) {
for(String s: listoflists) {
dto.setKey1(s);
}
dtoList.add(dto);
}
}
See https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html
Stream operations are divided into intermediate and terminal operations, and are combined to form stream pipelines. A stream pipeline consists of a source (such as a Collection, an array, a generator function, or an I/O channel); followed by zero or more intermediate operations such as Stream.filter or Stream.map; and a terminal operation such as Stream.forEach or Stream.reduce.
So in your snippet above, filter isn't really doing anything. To trigger it, you'd add a collect operation at the end. Notice that the filter lambda function needs to return a boolean for your code to compile in the first place.
mapData.entrySet().stream().filter(entry -> {
// do something here
return true;
}).collect(Collectors.toList());
Of course you don't need to abuse intermediate operations - or generate a bunch of new objects - for straightforward tasks, something like this should suffice:
mapData.entrySet().stream().forEach(entry -> {
// do something
});

Creating all pairs of list of values in hadoop

I have a small map-reduce program I'm writing for hadoop, one element of the program is to create all pairs of a list. For example if the input for the program is:
item1 tag1
item2 tag1
item3 tag2
item4 tag1
item5 tag2
My map function creates a <tag, item> pair, so the reducer receives <tag, List<item>> as its input. My goal is for the output from the reducer to be:
item1-item2 tag1
item1-item4 tag1
item2-item4 tag1
item3-item5 tag2
So essentially, for every list of values, to create all the possible pairs, and make each pair a key.
I have found a solution that works, but it relies on copying the list into memory and iterating over it. This might be a problem since my dataset can be very large:
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
List<String> list = new ArrayList<String>();
for (Text t : values) {
list.add(t.toString());
}
for (int i=0; i<list.size()-1; i++) {
for (int j=i+1; j<list.size(); j++) {
out.set(list.get(i) + "-" + list.get(j))
context.write(out, one);
}
}
}
Is there an alternative, or more efficient way of doing it in hadoop?
I do not want to be copying each list into memory.
I've been trying to come up with something creative like using another map-reduce step, but cannot seem to find something that works.
Thank you!
The reducer does get all of that data, but that data is actually written to disk and is only brought into memory as you iterate through the Iteratable of values. In fact, the object that is returned by that iteration is reused for each value: the fields and other state are simply replaced before the object is handed to you.
That means you have to explicitly copy the value object in order to have all value objects in memory at the same time.
When i look at your code it seems you are not saving the item pairs in memory. You are writing out the item pairs directly so you should be good.

Spark flatMap/reduce: How to scale and avoid OutOfMemory?

I am migrating some map-reduce code into Spark, and having problems when constructing an Iterable to return in the function.
In MR code, I had a reduce function that grouped by key, and then (using multipleOutputs) would iterate the values and use write (in multiple outputs, but that's unimportant) to some code like this (simplified):
reduce(Key key, Iterable<Text> values) {
// ... some code
for (Text xml: values) {
multipleOutputs.write(key, val, directory);
}
}
However, in Spark I have translated a map and this reduce into a sequence of:
mapToPair -> groupByKey -> flatMap
as recommended... in some book.
mapToPair basically adds a Key via functionMap, which based on some values on the record creates a Key for that record. Sometimes a key may have ver high cardinality.
JavaPairRDD<Key, String> rddPaired = inputRDD.mapToPair(new PairFunction<String, Key, String>() {
public Tuple2<Key, String> call(String value) {
//...
return functionMap.call(value);
}
});
The rddPaired is applied a RDD.groupByKey() to get the RDD to feed the flatMap function:
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.groupByKey();
Once grouped, a flatMap call to do the reduce. Here, operation is a transformation :
public Iterable<String> call (Tuple2<Key, Iterable<String>> keyValue) {
// some code...
List<String> out = new ArrayList<String>();
if (someConditionOnKey) {
// do a logic
Grouper grouper = new Grouper();
for (String xml : keyValue._2()) {
// group in a separate class
grouper.add(xml);
}
// operation is now performed on the whole group
out.add(operation(grouper));
} else {
for (String xml : keyValue._2()) {
out.add(operation(xml));
}
return out;
}
}
It works fine... with keys that don't have too many records. Actually, it breaks by OutOfMemory when a key with lot of values enters the "else" on the reduce.
Note: I have included the "if" part to explain the logic I want to produce, but the failure happens when entering the "else"... because when data enters the "else", it normally means there will be many more values for that due by the nature of the data.
It is clear that, having to keep all of the grouped values in "out" list, it won't scale if a key has millions of records, because it will keep them in memory. I have reached the point where the OOM happens (yes, it's when performing the "operation" above which asks for memory - and none is given. It's not a very expensive memory operation though).
Is there any way to avoid this in order to scale? Either by replicating behaviour using some other directives to reach the same output in a more scalable way, or to be able to hand to Spark the values for merging (just as I used to do with MR)...
It's inefficient to do condition inside the flatMap operation. You should check the condition outside to create 2 distinct RDDs and deal with them separatedly.
rddPaired.cache();
// groupFilterFunc will filter which items need grouping
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.filter(groupFilterFunc).groupByKey();
// processGroupedValuesFunction should call `operation` on group of all values with the same key and return the result
rddGrouped.mapValues(processGroupedValuesFunction);
// nogroupFilterFunc will filter which items don't need grouping
JavaPairRDD<Key, Iterable<String>> rddNoGrouped = rddPaired.filter(nogroupFilterFunc);
// processNoGroupedValuesFunction2 should call `operation` on a single value and return the result
rddNoGrouped.mapValues(processNoGroupedValuesFunction2);

Sorting of 2 or more massive resultsets?

I need to be able to sort multiple intermediate result sets and enter them to a file in sorted order. Sort is based on a single column/key value. Each result set record will be list of values (like a record in a table)
The intermediate result sets are got by querying entirely different databases.
The intermediate result sets are already sorted based on some key(or column). They need to be combined and sorted again on the same key(or column) before writing it to a file.
Since these result sets can be massive(order of MBs) this cannot be done in memory.
My Solution broadly :
To use a hash and a random access file . Since the result sets are already sorted, when retrieving the result sets , I will store the sorted column values as keys in a hashmap.The value in the hashmap will be a address in the random access file where every record associated with that column value will be stored.
Any ideas ?
Have a pointer into every set, initially pointing to the first entry
Then choose the next result from the set, that offers the lowest entry
Write this entry to the file and increment the corresponding pointer
This approach has basically no overhead and time is O(n). (it's Merge-Sort, btw)
Edit
To clarify: It's the merge part of merge sort.
If you've got 2 pre-sorted result sets, you should be able to iterate them concurrently while writing the output file. You just need to compare the current row in each set:
Simple example (not ready for copy-and-paste use!):
ResultSet a,b;
//fetch a and b
a.first();
b.first();
while (!a.isAfterLast() || !b.isAfterLast()) {
Integer valueA = null;
Integer valueB = null;
if (a.isAfterLast()) {
writeToFile(b);
b.next();
}
else if (b.isAfterLast()) {
writeToFile(a);
a.next();
} else {
int valueA = a.getInt("SORT_PROPERTY");
int valueB = b.getInt("SORT_PROPERTY");
if (valueA < valueB) {
writeToFile(a);
a.next();
} else {
writeToFile(b);
b.next();
}
}
}
Sounds like you are looking for an implementation of the Balance Line algorithm.

Categories