Adding data to a hashmap from on apache-spark RDD operation (Java)

Adding data to a hashmap from on apache-spark RDD operation (Java) - java

I've used a map step to create a JavaRDD object containing some objects I need. Based on those objects I want to create a global hashmap containing some stats, but I can't figure out which RDD operation to use. At first I thought reduce would be the solution, but then I saw that you have to return the same type of objects. I'm not interested in reducing the items, but in gathering all the stats from all the machines (they can be computed separately and then just added up_.
For example:
I have an RDD of Objects containing an integer array among other stuff and I want to compute how many times each of the integers has appeared in the array by putting them into a hashtable. Each machine should compute it's own hashtable and then put them all in one place in the driver.

Often when you think you want to end up with a Map, you'd need to transform your records in the RDD into key-value pairs, and use reduceByKey.
Your specific example sounds exactly like the famous wordcount example (see first example here), only you want to count integers from an array within an object, instead of counting words from a sentence (String). In Scala, this would translate to:
import org.apache.spark.rdd.RDD
import scala.collection.Map
class Example {
case class MyObj(ints: Array[Int], otherStuff: String)
def countInts(input: RDD[MyObj]): Map[Int, Int] = {
input
.flatMap(_.ints) // flatMap maps each record into several records - in this case, each int becomes a record
.map(i => (i, 1)) // turn into key-value map, with preliminary value 1 for each key
.reduceByKey(_ + _) // aggregate values by key
.collectAsMap() // collects data into a Map
}
}
Generally, you should let Spark perform as much of the operation as possible in a distributed manner, and delay the collection into memory as much as possible - if you collect the values before reducing, often you'll run out of memory, unless your dataset is small enough to begin with (in which case, you don't really need Spark).
Edit: and here's the same code in Java (much longer, but identical...):
static class MyObj implements Serializable {
Integer[] ints;
String otherStuff;
}
Map<Integer, Integer> countInts(JavaRDD<MyObj> input) {
return input
.flatMap(new FlatMapFunction<MyObj, Integer>() {
#Override
public Iterable<Integer> call(MyObj myObj) throws Exception {
return Arrays.asList(myObj.ints);
}
}) // flatMap maps each record into several records - in this case, each int becomes a record
.mapToPair(new PairFunction<Integer, Integer, Integer>() {
#Override
public Tuple2<Integer, Integer> call(Integer integer) throws Exception {
return new Tuple2<>(integer, 1);
}
}) // turn into key-value map, with preliminary value 1
.reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
}) // aggregate values by key
.collectAsMap(); // collects data into a Map
}

Related

Have to display the number of times an element has been added in my map JAVA

I've created a TreeMap with products.
And I want to count the number of times they repeat themselves, but have no clue what to code. Any hints? (I expect no code, just suggestions)
private static Map<Integer, String> shoppingCart() {
Map<Integer, String> result = new TreeMap<>();
result.put(1, "sausage");
result.put(2, "sausage");
result.put(3, "soup");
result.put(4, "egg");
result.put(5, "egg");
result.put(6, "tomato");
result.put(7, "sausage");
return result;
}
I was thinking about adding a counting variable, but still it doesn't fix the repeating problem.

Maybe not the best approach, but without modifying what you already have, you could use another map to store the products as keys and the quantity as the value for those keys:
Map<Integer, String> result = shoppingCart();
Map<String, Integer> productQuantities = new HashMap<>();
result.values().forEach(value ->
productQuantities.put(value,productQuantities.getOrDefault(value, 0) + 1));
To print the resulting map:
productQuantities.forEach((key, value) -> System.out.println(key + ":" + value));

I created a TreeMap with products, and i want to count the number of times they repeat themselves
Probably a different type of Map with keys representing items and values representing the corresponding count would be more handy. Something like:
NavigableMap<String, Integer> countByItem
Note: in order to access methods of the TreeMap like ceilingKey(), floorKey(), higherEntry(), lowerEntry(), etc. which are not defined in the Map interface you need to use NavigableMap as a type.
And it might make sense to make the item to be a custom object, instead of being a String. That would guard you from making typo, and it provides a possibility to gives useful behavior to Item
public class Item {
private int id;
private String name;
// constructor, getters, equals/hashCode, ect.
}
That's how map of items can be updated using Java 8 method merge(), which expects a key, a value and a function responsible for merging the old value and the new one:
NavigableMap<Item, Integer> countByItem = new TreeMap<>(Comparator.comparingInt(Item::getId));
countByItem.merge(new Item(1, "sausage"),1, Integer::sum);
countByItem.merge(new Item(1, "sausage"),1, Integer::sum);
countByItem.merge(new Item(2, "soup"),1, Integer::sum);
If you don't feel very comfortable with Java 8 functions, instead of merge() you can use combination of methods put() & getOrDefault():
Item sausage = new Item(1, "sausage");
countByItem.put(sausage, countByItem.getOrDefault(sausage, 0) + 1);

I can only guess at your goal. In your Map <Integer, String>, what does the Integer represent? Product number? Quantity? Sequence number? Something else?
If the Integer represents quantity, you have it backwards. It should be Map <String, Integer>. In a Map<X,Y>, the X represents the key. A Map allows fast lookup by the key. The Y is the value -- the information you want to find for a particular key, if the key is in the Map.
For example, if you want to add "sausage", you want to check if it is in the Map. If it isn't, put it into the Map with quantity 1. If it is, retrieve it and update the quantity.
If the Integer represents a sequence number (1st item, 2nd item, 3rd item, ...), you don't need a Map. Consider using an array or a data structure that preserves order, such as a List.
However, using an array or List still leaves you with the problem of how find how many of each item are in the list, when duplicates are allowed, as they are in your problem. To solve that, consider a Map<String, Integer> where the Integer (map value) is the quantity, and the String (map key) is the product name.
If I were doing this, I'd create classes to allow me to glue together related information. Here is part of a hypothetical example, which might be more realistic than you need:
public class Product {
private int upc; // product code, often represented with bar code
private Decimal price;
private String description;
private String shortDescription;
private ProductClass prodClass; // department, taxable, etc.
// etc. -- add needed fields, or remove irrelevant
// constructors, getters, setters,
Override .equals and .hashcode in Product. You use the UPC for those.
If you use implements Comparable<Product>, you have the possibility of using binary search, or a search tree.
public class Receipt {
private Decimal total;
private Decimal taxableTotal;
private Map <Product,Integer> shoppingCart; // Product, Quantity
// etc.
When each item is scanned, you can lookup the Product in the Map, and add it if not found, or update the quantity if found, as in the previous answers.

How to iterate over a map and return all the ones that match?

I have made various methods for someone to add a key which then includes various values from another created object.
I need to then allow a user to search using a method name which will then return all the people that match their search.
public Set findPerson(String aTrade)
{
Set<String> suitablePeople = new HashSet<>();
for (String test : this.contractors.keySet())
{
System.out.println(contractors.get(test));
if (contractors.containsValue(aTrade))
{
suitablePeople.add(test);
}
}
return suitablePeople;
}
I know this code is wrong but I'm just not sure how I can simply go through and find a value and return all the people that have this value within a range of values. For instance, their age, job, location.

Some assumptions, because your question is rather unclear:
contractors is a Map<String, ContractorData> field. Possibly ContractorData is some collection type (such as MyList<Contractor>), or named differently. The String represents a username.
aTrade is a string, and you want to search for it within the various ContractorData objects stored in your map. Then you want to return a collection of all username strings that are mapped to a ContractorData object that contains a trade that matches aTrade.
Whatever ContractorData might be, it has method containsValue(String) which returns true if the contractordata is considered a match. (If that was pseudocode and it's actually a List<String>, just .contains() would do the job. If it's something else you're going to have to elaborate in your question.
Then, there is no fast search available; maps allow you to do quick searches on their key (and not any particular property of their key, and not on their value or any particular property of their value). Thus, any search inherently implies you go through all the key/value mappings and check for each, individually, if it matches or not. If this is not an acceptable performance cost, you'd have to make another map, one that maps this property to something. This may have to be a multimap, and is considerably more complicated.
The performance cost is not important
Okay, then, just.. loop, but note that the .entrySet() gives you both key (which you'll need in case it's a match) and value (which you need to check if it matches), so that's considerably simpler.
var out = new ArrayList<String>();
for (var e : contracts.entrySet()) {
if (e.getValue().containsValue(aTrade)) out.add(e.getKey());
}
return out;
or if you prefer stream syntax:
return contracts.entrySet().stream()
.filter(e -> e.getValue().containsValue(aTrade))
.map(Map.Entry::getKey)
.toList();
The performance cost is important
Then it gets complicated. You'd need a single object that 'wraps' around at least two maps, and you need this because you want these maps to never go 'out of sync'. You need one map for each thing you want to have a find method for.
Thus, if you want a getTradesForUser(String username) as well as a findAllUsersWithTrade(String aTrade), you need two maps; one that maps users to trades, one that maps trades to users. In addition, you need the concept of a multimap: A map that maps one key to (potentially) more than one value.
You can use guava's MultiMaps (guava is a third party library with some useful stuff, such as multimaps), or, you roll your own, which is trivial:
given:
class ContractData {
private List<String> trades;
public boolean containsValue(String trade) {
return trades.contains(trade);
}
public List<String> getTrades() {
return trades;
}
}
then:
class TradesStore {
Map<String, ContractData> usersToTrades = new HashMap<>();
Map<String, List<String>> tradesToUsers = new HashMap<>();
public void put(String username, ContractData contract) {
usersToTrades.put(username, contract);
for (String trade : contract.getTrades()) {
tradesToUsers.computeIfAbsent(username, k -> new ArrayList<>()).add(username);
}
}
public Collection<String> getUsersForTrade(String trade) {
return tradesToUsers.getOrDefault(trade, List.of());
}
}
The getOrDefault method lets you specify a default value in case the trade isn't in the map. Thus, if you ask for 'get me all users which have trade [SOME_VALUE_NOBODY_IS_TRADING]', this returns an empty list (List.of() gives you an empty list), which is the right answer (null would be wrong - there IS an answer, and it is: Nobody. null is means: Unknown / irrelevant, and is therefore incorrect here).
The computeIfAbsent method just gets you the value associated with a key, but, if there is no such key/value mapping yet, you also give it the code required to make it. Here, we pass a function (k -> new ArrayList<>()) which says: just.. make a new arraylist first if I ask for a key that isn't yet in there, put it in the map, and then return that (k is the key, which we don't need to make the initial value).
Thus, computeIfAbsent and getOrDefault in combination make the concept of a multimap easy to write.

Assuming that your Map's values are instances of Contractor and the Contractor class has a Set<String> of trades (implied by the contains method call) and a getTrades() method that returns the list you could do it like this. Not certain what value the Map key would play in this.
get the values from the map and stream them.
filter only those Contractors that have the appropriate trade.
aggregate to a set of able contractors.
Set<Contractor> suitablePeople =
contractors.values()
.stream()
.filter(c->c.getTrades().contains(aTrade))
.collect(Collectors.toSet());
Note that a possible improvement would be to have a map like the following.
Map<String, Set<Contractors>> // where the key is the desired trade.
Then you could just get the Contractors with a single lookup up for each desired trade.
Set<Contractors> plumbers = mapByTrade.get("plumbers"); // all done.
Here is how you would set it up. The Contractor class is at the end. It takes a name and a variable array of trades.
Set<Contractor> contractors = Set.of(
new Contractor("Acme", "plumbing", "electrical", "masonry", "roofing", "carpet"),
new Contractor("Joe's plumbing", "plumbing"),
new Contractor("Smith", "HVAC", "electrical"),
new Contractor("Ace", "electrical"));
Then, iterate over the list of contractors to create the map. Then those are grouped by trade, and each contractor that matches is put in the associated set for that trade.
Map<String,Set<Contractor>> mapByTrade = new HashMap<>();
for (Contractor c : contractors) {
for (String trade : c.getTrades()) {
mapByTrade.computeIfAbsent(trade, v->new HashSet<>()).add(c);
}
}
And here it is in action.
Set<Contractor> plumbers = mapByTrade.get("plumbing");
System.out.println(plumbers);
System.out.println(mapByTrade.get("electrical"));
System.out.println(mapByTrade.get("HVAC"));
prints
[Acme, Joe's plumbing]
[Ace, Acme, Smith]
[Smith]
And here is the Contractor class.
class Contractor {
private Set<String> trades;
private String name;
#Override
public int hashCode() {
return name.hashCode();
}
#Override
public boolean equals(Object ob) {
if (ob == name) {
return true;
}
if (ob == null) {
return false;
}
if (ob instanceof Contractor) {
return ((Contractor)ob).name.equals(this.name);
}
return false;
}
public Contractor(String name, String...trades) {
this.name = name;
this.trades = new HashSet<>(Arrays.asList(trades));
}
public Set<String> getTrades() {
return trades;
}
#Override
public String toString() {
return name;
}
}

Partition Strategy for applying multiple JOINs on a Flink DataSet

I am using Flink 1.4.0.
Suppose I have a POJO as follows:
public class Rating {
public String name;
public String labelA;
public String labelB;
public String labelC;
...
}
and a JOIN function:
public class SetLabelA implements JoinFunction<Tuple2<String, Rating>, Tuple2<String, String>, Tuple2<String, Rating>> {
#Override
public Tuple2<String, Rating> join(Tuple2<String, Rating> rating, Tuple2<String, String> labelA) {
rating.f1.setLabelA(labelA)
return rating;
}
}
and suppose I want to apply a JOIN operation to set the values of each field in a DataSet<Tuple2<String, Rating>>, which I can do as follows:
DataSet<Tuple2<String, Rating>> ratings = // [...]
DataSet<Tuple2<String, Double>> aLabels = // [...]
DataSet<Tuple2<String, Double>> bLabels = // [...]
DataSet<Tuple2<String, Double>> cLabels = // [...]
...
DataSet<Tuple2<String, Rating>>
newRatings =
ratings.leftOuterJoin(aLabels, JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE)
// key of the first input
.where("f0")
// key of the second input
.equalTo("f0")
// applying the JoinFunction on joining pairs
.with(new SetLabelA());
Unfortunately, this is necessary as both ratings and all xLabels are very big DataSets and I am forced to look into each of the xlabels to find the field values I require, while at the same time it is not the case that all rating keys exist in each xlabels.
This practically means that I have to perform a leftOuterJoin per xlabel, for which I need to also create the respective JoinFunction implementation that utilises the correct setter from the Rating POJO.
Is there a more efficient way to solve this that anyone can think of?
As far as the partitioning strategy goes, I have made sure to sort the DataSet<Tuple2<String, Rating>> ratings with:
DataSet<Tuple2<String, Rating>> sorted_ratings = ratings.sortPartition(0, Order.ASCENDING).setParallelism(1);
By setting parallelism to 1 I can be sure that the whole dataset will be ordered. I then use .partitionByRange:
DataSet<Tuple2<String, Rating>> partitioned_ratings = sorted_ratings.partitionByRange(0).setParallelism(N);
where N is the number of cores I have on my VM. Another side question I have here is whether the first .setParallelism which is set to 1 is restrictive in terms of how the rest of the pipeline is executed, i.e. can the follow up .setParallelism(N) change how the DataSet is processed?
Finally, I did all these so that when partitioned_ratings is joined with a xlabels DataSet, the JOIN operation will be done with JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE. According to Flink docs for v.1.4.0:
REPARTITION_SORT_MERGE: The system partitions (shuffles) each input (unless the input is already partitioned) and sorts each input (unless it is already sorted). The inputs are joined by a streamed merge of the sorted inputs. This strategy is good if one or both of the inputs are already sorted.
So in my case, ratings is sorted (I think) and each of the xlabels DataSets are not, hence it makes sense that this is the most efficient strategy. Anything wrong with this? Any alternative approaches?

So far I haven't been able to pull through this strategy. It seems like relying on JOINs is too troublesome as they are expensive operations and one should avoid them unless they are really necessary.
For instance, JOINs should be used if both Datasets are very big in size. If they are not, a convenient alternative is the use of BroadCastVariables by which one of the two Datasets (the smallest), is broadcasted across workers for whatever purpose it is used. A example appears below (copied from this link for convenience)
DataSet<Point> points = env.readCsv(...);
DataSet<Centroid> centroids = ... ; // some computation
points.map(new RichMapFunction<Point, Integer>() {
private List<Centroid> centroids;
#Override
public void open(Configuration parameters) {
this.centroids = getRuntimeContext().getBroadcastVariable("centroids");
}
#Override
public Integer map(Point p) {
return selectCentroid(centroids, p);
}
}).withBroadcastSet("centroids", centroids);
Also, since populating fields of a POJO implies that a quite similar code will be leverage repeatedly, one should definitely use jlens to avoid code repetition and write a more concise and easy to follow solution.

Get total for unique items from Map in java

I currently have a map which stores the following information:
Map<String,String> animals= new HashMap<String,String>();
animals.put("cat","50");
animals.put("bat","38");
animals.put("dog","19");
animals.put("cat","31");
animals.put("cat","34");
animals.put("bat","1");
animals.put("dog","34");
animals.put("cat","55");
I want to create a new map with total for unique items in the above map. So in the above sample, count for cat would be 170, count for bat would be 39 and so on.
I have tried using Set to find unique animal entries in the map, however, I am unable to get the total count for each unique entry

First, don't use String for arithmetic, use int or double (or BigInteger/BigDecimal, but that's probably overkill here). I'd suggest making your map a Map<String, Integer>.
Second, Map.put() will overwrite the previous value if the given key is already present in the map, so as #Guy points out your map actually only contains {cat:55, dog:34, bat:1}. You need to get the previous value somehow in order to preserve it.
The classic way (pre-Java-8) is like so:
public static void putOrUpdate(Map<String, Integer> map, String key, int value) {
Integer previous = map.get(key);
if (previous != null) {
map.put(key, previous + value);
} else {
map.put(key, value);
}
}
Java 8 adds a number of useful methods to Map to make this pattern easier, like Map.merge() which does the put-or-update for you:
map.merge(key, value, (p, v) -> p + v);
You may also find that a multiset is a better data structure to use as it handles incrementing/decrementing for you; Guava provides a nice implementation.

As Guy said. Now you have one bat, one dog and one cat. Another 'put's will override your past values. For definition. Map stores key-value pairs where each key in map is unique. If you have to do it by map you can sum it just in time. For example, if you want to add another value for cat and you want to update it you can do it in this way:
animals.put("cat", animals.get("cat") + yourNewValue);
Your value for cat will be updated. This is for example where our numbers are float/int/long, not string as you have. If you have to do it by strings you can use in this case:
animals.put("cat", Integer.toString(Integer.parseInt(animals.get("cat")) + yourNewValue));
However, it's ugly. I'd recommend create
Map<String, Integer> animals = new HashMap<String, Integer>();

Spark flatMap/reduce: How to scale and avoid OutOfMemory?

I am migrating some map-reduce code into Spark, and having problems when constructing an Iterable to return in the function.
In MR code, I had a reduce function that grouped by key, and then (using multipleOutputs) would iterate the values and use write (in multiple outputs, but that's unimportant) to some code like this (simplified):
reduce(Key key, Iterable<Text> values) {
// ... some code
for (Text xml: values) {
multipleOutputs.write(key, val, directory);
}
}
However, in Spark I have translated a map and this reduce into a sequence of:
mapToPair -> groupByKey -> flatMap
as recommended... in some book.
mapToPair basically adds a Key via functionMap, which based on some values on the record creates a Key for that record. Sometimes a key may have ver high cardinality.
JavaPairRDD<Key, String> rddPaired = inputRDD.mapToPair(new PairFunction<String, Key, String>() {
public Tuple2<Key, String> call(String value) {
//...
return functionMap.call(value);
}
});
The rddPaired is applied a RDD.groupByKey() to get the RDD to feed the flatMap function:
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.groupByKey();
Once grouped, a flatMap call to do the reduce. Here, operation is a transformation :
public Iterable<String> call (Tuple2<Key, Iterable<String>> keyValue) {
// some code...
List<String> out = new ArrayList<String>();
if (someConditionOnKey) {
// do a logic
Grouper grouper = new Grouper();
for (String xml : keyValue._2()) {
// group in a separate class
grouper.add(xml);
}
// operation is now performed on the whole group
out.add(operation(grouper));
} else {
for (String xml : keyValue._2()) {
out.add(operation(xml));
}
return out;
}
}
It works fine... with keys that don't have too many records. Actually, it breaks by OutOfMemory when a key with lot of values enters the "else" on the reduce.
Note: I have included the "if" part to explain the logic I want to produce, but the failure happens when entering the "else"... because when data enters the "else", it normally means there will be many more values for that due by the nature of the data.
It is clear that, having to keep all of the grouped values in "out" list, it won't scale if a key has millions of records, because it will keep them in memory. I have reached the point where the OOM happens (yes, it's when performing the "operation" above which asks for memory - and none is given. It's not a very expensive memory operation though).
Is there any way to avoid this in order to scale? Either by replicating behaviour using some other directives to reach the same output in a more scalable way, or to be able to hand to Spark the values for merging (just as I used to do with MR)...

It's inefficient to do condition inside the flatMap operation. You should check the condition outside to create 2 distinct RDDs and deal with them separatedly.
rddPaired.cache();
// groupFilterFunc will filter which items need grouping
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.filter(groupFilterFunc).groupByKey();
// processGroupedValuesFunction should call `operation` on group of all values with the same key and return the result
rddGrouped.mapValues(processGroupedValuesFunction);
// nogroupFilterFunc will filter which items don't need grouping
JavaPairRDD<Key, Iterable<String>> rddNoGrouped = rddPaired.filter(nogroupFilterFunc);
// processNoGroupedValuesFunction2 should call `operation` on a single value and return the result
rddNoGrouped.mapValues(processNoGroupedValuesFunction2);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Adding data to a hashmap from on apache-spark RDD operation (Java) - java

Related

Have to display the number of times an element has been added in my map JAVA

How to iterate over a map and return all the ones that match?

Partition Strategy for applying multiple JOINs on a Flink DataSet

Get total for unique items from Map in java

Spark flatMap/reduce: How to scale and avoid OutOfMemory?

Categories

Resources