Let's say I have these piece of code. As far as I know, the code below, runs like that if I have 10 query and run them at the same time, and each query return 10M results, I have to wait 100M rows fetched from Database to start group function.
My problem, since the cardinality of Country and City cartesian product is low, And the number of rows which I have to fetch from database is huge. I wanna immediately calcute the group result when a row fetched from database. How can I do that using Java Stream?
myqueries
.parallelstream()
.map( m-> {
//queryresult is a stream which return database rows
return queryresult;
})
.flatMap(fm-> fm)
.collect(Collectors.groupingBy(g-> {
List<Object> objects = Arrays.<Object>asList(
g.getCountry(),
g.getCity());
return objects;
}, Collectors.toList()))
.entrySet().stream().map(m-> {
MyResultClass item = new MyResultClass();
item.setCountry((String) m.getKey().get(0));
item.setCity((String) m.getKey().get(1));
item.setSumField1(m.getValue().stream().mapToDouble(m2-> m2.getSumField1()).sum());
item.setSumField2(m.getValue().stream().mapToDouble(m2-> m2.getSumField2()).sum());
item.setSumField3(m.getValue().stream().mapToDouble(m2-> m2.getSumField3()).sum());
return item;
}).forEach(f-> {
//print the MyResultClass fields
});
The problem with your solution is, you are collecting all data into a list, just to do further reduction. So it will accumulate all data into memory. You can combine both reduction into single one using toMap like this :
myqueries
.parallelstream()
.flatMap( m-> {
//queryresult is a stream which return database rows
return queryresult;
})
.collect(Collectors.toMap(
g-> Arrays.<Object>asList(g.getCountry(), g.getCity()),
v -> {
MyResultClass item = new MyResultClass();
item.setCountry(v.getCountry());
item.setCity(v.getCity());
return item;
},
(t, u) -> {
t.setSumField1(t.getSumField1() + u.getSumField1());
t.setSumField2(t.getSumField2() + u.getSumField3());
t.setSumField3(t.getSumField3() + u.getSumField3());
return t;
}
)
.values().forEach(f-> {
//print the MyResultClass fields
});
Also, note that, when you use parallelStream here, that does not mean all queries will be running in parallel. Parallelism will depend on number of queries, number of cores in your machine and runtime environment. If you want to control the concurrent query behaviour, better use ExecutorService.
Another point to note is, execution will also depend on how you are creating Stream from the query result in the first place. If you wait till you get all result, and then create Stream, then you will defeat the purpose of the question itself.
Related
I have the following stream code:
List<Data> results = items.stream()
.map(item -> requestDataForItem(item))
.filter(data -> data.isValid())
.collect(Collectors.toList());
Data requestDataForItem(Item item) {
// call another service here
}
The problem is that I want to call
requestDataForItem only when all elements in the stream are valid.
For example,
if the first item is invalid I don't wont to make the call for any element in the stream.
There is .allMatch in the stream API,
but it returns a boolean.
I want to do the same as .allMatch than
.collect the result when everything matched.
Also, I want to process stream only once,
with two loops it is easy.
Is this possible with the Java Streams API?
This would be a job for Java 9:
List<Data> results = items.stream()
.map(item -> requestDataForItem(item))
.takeWhile(data -> data.isValid())
.collect(Collectors.toList());
This operation will stop at the first invalid element. In a sequential execution, this implies that no subsequent requestDataForItem calls are made. In a parallel execution, some additional elements might get processed concurrently, before the operation stops, but that’s the price for efficient parallel processing.
In either case, the result list will only contain the elements before the first encountered invalid element and you can easily check using results.size() == items.size() whether all elements were valid.
In Java 8, there is no such simple method and using an additional library or rolling out your own implementation of takeWhile wouldn’t pay off considering how simple the non-stream solution would be
List<Data> results = new ArrayList<>();
for(Item item: items) {
Data data = requestDataForItem(item);
if(!data.isValid()) break;
results.add(data);
}
You could theoretically use .allMatch then collect if .allMatch returns true, but then you'd be processing the collection twice. There's no way to do what you're trying to do with the streams API directly.
You could create a method to do this for you and simply pass your collection to it as opposed to using the stream API. This is slightly less elegant than using the stream API but more efficient as it processes the collection only once.
List<Data> results = getAllIfValid(
items.stream().map(item ->
requestDataForItem(item).collect(Collectors.toList())
);
public List<Data> getAllIfValid(List<Data> items) {
List<Data> results = new ArrayList<>();
for (Data d : items) {
if (!d.isValid()) {
return new ArrayList<>();
}
results.add(d);
}
return results;
}
This will return all the results if every element passes and only processes the items collection once. If any fail the isValid() check, it'll return an empty list as you want all or nothing. Simply check to see if the returned collection is empty to see whether or not all items passed the isValid() check.
Implement a two step process:
test if allMatch returns true.
If it does return true, do the collect with a second stream.
Try this.
List<Data> result = new ArrayList<>();
boolean allValid = items.stream()
.map(item -> requestDataForItem(item))
.allMatch(data -> data.isValid() && result.add(data));
if (!allValid)
result.clear();
_logger.info("data size : "+saleData.size);
saleData.parallelStream().forEach(data -> {
SaleAggrData saleAggrData = new SaleAggrData() {
{
setCatId(data.getCatId());
setRevenue(RoundUpUtil.roundUpDouble(data.getRevenue()));
setMargin(RoundUpUtil.roundUpDouble(data.getMargin()));
setUnits(data.getUnits());
setMarginRate(ComputeUtil.marginRate(data.getRevenue(), data.getMargin()));
setOtd(ComputeUtil.OTD(data.getRevenue(), data.getUnits()));
setSaleDate(data.getSaleDate());
setDiscountDepth(ComputeUtil.discountDepth(data.getRegularPrice(), data.getRevenue()));
setTransactions(data.getTransactions());
setUpt(ComputeUtil.UPT(data.getUnits(), data.getTransactions()));
}
};
salesAggrData.addSaleAggrData(saleAggrData);
});
The Issue with code is that when I am getting an response from DB, and while iterating using a parallel stream, the data size is different every time, while when using a sequential stream it's working fine.
I can't use a sequential Stream because the data is huge and it's taking time.
Any lead would be helpful.
You are adding elements in parallel to salesAggrData which I'm assuming is some Collection. If it's not a thread-safe Collection, no wonder you get inconsistent results.
Instead of forEach, why don't you use map() and then collect the result into some Collection?
List<SaleAggrData> salesAggrData =
saleData.parallelStream()
.map(data -> {
SaleAggrData saleAggrData = new SaleAggrData() {
{
setCatId(data.getCatId());
setRevenue(RoundUpUtil.roundUpDouble(data.getRevenue()));
setMargin(RoundUpUtil.roundUpDouble(data.getMargin()));
setUnits(data.getUnits());
setMarginRate(ComputeUtil.marginRate(data.getRevenue(), data.getMargin()));
setOtd(ComputeUtil.OTD(data.getRevenue(), data.getUnits()));
setSaleDate(data.getSaleDate());
setDiscountDepth(ComputeUtil.discountDepth(data.getRegularPrice(), data.getRevenue()));
setTransactions(data.getTransactions());
setUpt(ComputeUtil.UPT(data.getUnits(), data.getTransactions()));
}
};
return saleAggrData;
})
.collect(Collectors.toList());
BTW, I'd probably change that anonymous class instance creation, and use a constructor of a named class to create the SaleAggrData instances.
I've found a lot of answers regarding RxJava, but I want to understand how it works in Reactor.
My current understanding is very vague, i tend to think of map as being synchronous and flatMap to be asynchronous but I can't really get my had around it.
Here is an example:
files.flatMap { it ->
Mono.just(Paths.get(UPLOAD_ROOT, it.filename()).toFile())
.map {destFile ->
destFile.createNewFile()
destFile
}
.flatMap(it::transferTo)
}.then()
I have files (a Flux<FilePart>) and i want to copy it to some UPLOAD_ROOT on the server.
This example is taken from a book.
I can change all the .map to .flatMap and vice versa and everything still works. I wonder what the difference is.
map is for synchronous, non-blocking, 1-to-1 transformations
flatMap is for asynchronous (non-blocking) 1-to-N transformations
The difference is visible in the method signature:
map takes a Function<T, U> and returns a Flux<U>
flatMap takes a Function<T, Publisher<V>> and returns a Flux<V>
That's the major hint: you can pass a Function<T, Publisher<V>> to a map, but it wouldn't know what to do with the Publishers, and that would result in a Flux<Publisher<V>>, a sequence of inert publishers.
On the other hand, flatMap expects a Publisher<V> for each T. It knows what to do with it: subscribe to it and propagate its elements in the output sequence. As a result, the return type is Flux<V>: flatMap will flatten each inner Publisher<V> into the output sequence of all the Vs.
About the 1-N aspect:
for each <T> input element, flatMap maps it to a Publisher<V>. In some cases (eg. an HTTP request), that publisher will emit only one item, in which case we're pretty close to an async map.
But that's the degenerate case. The generic case is that a Publisher can emit multiple elements, and flatMap works just as well.
For an example, imagine you have a reactive database and you flatMap from a sequence of user IDs, with a request that returns a user's set of Badge. You end up with a single Flux<Badge> of all the badges of all these users.
Is map really synchronous and non-blocking?
Yes: it is synchronous in the way the operator applies it (a simple method call, and then the operator emits the result) and non-blocking in the sense that the function itself shouldn't block the operator calling it. In other terms it shouldn't introduce latency. That's because a Flux is still asynchronous as a whole. If it blocks mid-sequence, it will impact the rest of the Flux processing, or even other Flux.
If your map function is blocking/introduces latency but cannot be converted to return a Publisher, consider publishOn/subscribeOn to offset that blocking work on a separate thread.
The flatMap method is similar to the map method with the key difference that the supplier you provide to it should return a Mono<T> or Flux<T>.
Using the map method would result in a Mono<Mono<T>>
whereas using flatMap results in a Mono<T>.
For example, it is useful when you have to make a network call to retrieve data, with a java API that returns a Mono, and then another network call that needs the result of the first one.
// Signature of the HttpClient.get method
Mono<JsonObject> get(String url);
// The two urls to call
String firstUserUrl = "my-api/first-user";
String userDetailsUrl = "my-api/users/details/"; // needs the id at the end
// Example with map
Mono<Mono<JsonObject>> result = HttpClient.get(firstUserUrl).
map(user -> HttpClient.get(userDetailsUrl + user.getId()));
// This results with a Mono<Mono<...>> because HttpClient.get(...)
// returns a Mono
// Same example with flatMap
Mono<JsonObject> bestResult = HttpClient.get(firstUserUrl).
flatMap(user -> HttpClient.get(userDetailsUrl + user.getId()));
// Now the result has the type we expected
Also, it allows for handling errors precisely:
public UserApi {
private HttpClient httpClient;
Mono<User> findUser(String username) {
String queryUrl = "http://my-api-address/users/" + username;
return Mono.fromCallable(() -> httpClient.get(queryUrl)).
flatMap(response -> {
if (response.statusCode == 404) return Mono.error(new NotFoundException("User " + username + " not found"));
else if (response.statusCode == 500) return Mono.error(new InternalServerErrorException());
else if (response.statusCode != 200) return Mono.error(new Exception("Unknown error calling my-api"));
return Mono.just(response.data);
});
}
}
How map internally works in the Reactor.
Creating a Player class.
#Data
#AllArgsConstructor
public class Player {
String name;
String name;
}
Now creating some instances of Player class
Flux<Player> players = Flux.just(
"Zahid Khan",
"Arif Khan",
"Obaid Sheikh")
.map(fullname -> {
String[] split = fullname.split("\\s");
return new Player(split[0], split[1]);
});
StepVerifier.create(players)
.expectNext(new Player("Zahid", "Khan"))
.expectNext(new Player("Arif", "Khan"))
.expectNext(new Player("Obaid", "Sheikh"))
.verifyComplete();
What’s important to understand about the map() is that the mapping is
performed synchronously, as each item is published by the source Flux.
If you want to perform the mapping asynchronously, you should consider
the flatMap() operation.
How FlatMap internally works.
Flux<Player> players = Flux.just(
"Zahid Khan",
"Arif Khan",
"Obaid Sheikh")
.flatMap(
fullname ->
Mono.just(fullname).map(p -> {
String[] split = p.split("\\s");
return new Player(split[0], split[1]);
}).subscribeOn(Scheduler.parallel()));
List<Player> playerList = Arrays.asList(
new Player("Zahid", "Khan"),
new Player("Arif", "Khan"),
new Player("Obaid", "Sheikh"));
StepVerifier.create(players).expectNextMatches(player ->
playerList.contains(player))
.expectNextMatches(player ->
playerList.contains(player))
.expectNextMatches(player ->
playerList.contains(player))
.expectNextMatches(player ->
playerList.contains(player))
.verifyComplete();
Internally in a Flatmap(), a map() operation is performed to the Mono to transform the String to Player. Furthermore, subcribeOn () indicates that each subscription should take place in a parallel thread. In absence of subscribeOn() flatmap() acts as a synchronized.
The map is for synchronous, non-blocking, one-to-one transformations
while the flatMap is for asynchronous (non-blocking) One-to-Many transformations.
I am using Flink 1.4.0.
Suppose I have a POJO as follows:
public class Rating {
public String name;
public String labelA;
public String labelB;
public String labelC;
...
}
and a JOIN function:
public class SetLabelA implements JoinFunction<Tuple2<String, Rating>, Tuple2<String, String>, Tuple2<String, Rating>> {
#Override
public Tuple2<String, Rating> join(Tuple2<String, Rating> rating, Tuple2<String, String> labelA) {
rating.f1.setLabelA(labelA)
return rating;
}
}
and suppose I want to apply a JOIN operation to set the values of each field in a DataSet<Tuple2<String, Rating>>, which I can do as follows:
DataSet<Tuple2<String, Rating>> ratings = // [...]
DataSet<Tuple2<String, Double>> aLabels = // [...]
DataSet<Tuple2<String, Double>> bLabels = // [...]
DataSet<Tuple2<String, Double>> cLabels = // [...]
...
DataSet<Tuple2<String, Rating>>
newRatings =
ratings.leftOuterJoin(aLabels, JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE)
// key of the first input
.where("f0")
// key of the second input
.equalTo("f0")
// applying the JoinFunction on joining pairs
.with(new SetLabelA());
Unfortunately, this is necessary as both ratings and all xLabels are very big DataSets and I am forced to look into each of the xlabels to find the field values I require, while at the same time it is not the case that all rating keys exist in each xlabels.
This practically means that I have to perform a leftOuterJoin per xlabel, for which I need to also create the respective JoinFunction implementation that utilises the correct setter from the Rating POJO.
Is there a more efficient way to solve this that anyone can think of?
As far as the partitioning strategy goes, I have made sure to sort the DataSet<Tuple2<String, Rating>> ratings with:
DataSet<Tuple2<String, Rating>> sorted_ratings = ratings.sortPartition(0, Order.ASCENDING).setParallelism(1);
By setting parallelism to 1 I can be sure that the whole dataset will be ordered. I then use .partitionByRange:
DataSet<Tuple2<String, Rating>> partitioned_ratings = sorted_ratings.partitionByRange(0).setParallelism(N);
where N is the number of cores I have on my VM. Another side question I have here is whether the first .setParallelism which is set to 1 is restrictive in terms of how the rest of the pipeline is executed, i.e. can the follow up .setParallelism(N) change how the DataSet is processed?
Finally, I did all these so that when partitioned_ratings is joined with a xlabels DataSet, the JOIN operation will be done with JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE. According to Flink docs for v.1.4.0:
REPARTITION_SORT_MERGE: The system partitions (shuffles) each input (unless the input is already partitioned) and sorts each input (unless it is already sorted). The inputs are joined by a streamed merge of the sorted inputs. This strategy is good if one or both of the inputs are already sorted.
So in my case, ratings is sorted (I think) and each of the xlabels DataSets are not, hence it makes sense that this is the most efficient strategy. Anything wrong with this? Any alternative approaches?
So far I haven't been able to pull through this strategy. It seems like relying on JOINs is too troublesome as they are expensive operations and one should avoid them unless they are really necessary.
For instance, JOINs should be used if both Datasets are very big in size. If they are not, a convenient alternative is the use of BroadCastVariables by which one of the two Datasets (the smallest), is broadcasted across workers for whatever purpose it is used. A example appears below (copied from this link for convenience)
DataSet<Point> points = env.readCsv(...);
DataSet<Centroid> centroids = ... ; // some computation
points.map(new RichMapFunction<Point, Integer>() {
private List<Centroid> centroids;
#Override
public void open(Configuration parameters) {
this.centroids = getRuntimeContext().getBroadcastVariable("centroids");
}
#Override
public Integer map(Point p) {
return selectCentroid(centroids, p);
}
}).withBroadcastSet("centroids", centroids);
Also, since populating fields of a POJO implies that a quite similar code will be leverage repeatedly, one should definitely use jlens to avoid code repetition and write a more concise and easy to follow solution.
I am migrating some map-reduce code into Spark, and having problems when constructing an Iterable to return in the function.
In MR code, I had a reduce function that grouped by key, and then (using multipleOutputs) would iterate the values and use write (in multiple outputs, but that's unimportant) to some code like this (simplified):
reduce(Key key, Iterable<Text> values) {
// ... some code
for (Text xml: values) {
multipleOutputs.write(key, val, directory);
}
}
However, in Spark I have translated a map and this reduce into a sequence of:
mapToPair -> groupByKey -> flatMap
as recommended... in some book.
mapToPair basically adds a Key via functionMap, which based on some values on the record creates a Key for that record. Sometimes a key may have ver high cardinality.
JavaPairRDD<Key, String> rddPaired = inputRDD.mapToPair(new PairFunction<String, Key, String>() {
public Tuple2<Key, String> call(String value) {
//...
return functionMap.call(value);
}
});
The rddPaired is applied a RDD.groupByKey() to get the RDD to feed the flatMap function:
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.groupByKey();
Once grouped, a flatMap call to do the reduce. Here, operation is a transformation :
public Iterable<String> call (Tuple2<Key, Iterable<String>> keyValue) {
// some code...
List<String> out = new ArrayList<String>();
if (someConditionOnKey) {
// do a logic
Grouper grouper = new Grouper();
for (String xml : keyValue._2()) {
// group in a separate class
grouper.add(xml);
}
// operation is now performed on the whole group
out.add(operation(grouper));
} else {
for (String xml : keyValue._2()) {
out.add(operation(xml));
}
return out;
}
}
It works fine... with keys that don't have too many records. Actually, it breaks by OutOfMemory when a key with lot of values enters the "else" on the reduce.
Note: I have included the "if" part to explain the logic I want to produce, but the failure happens when entering the "else"... because when data enters the "else", it normally means there will be many more values for that due by the nature of the data.
It is clear that, having to keep all of the grouped values in "out" list, it won't scale if a key has millions of records, because it will keep them in memory. I have reached the point where the OOM happens (yes, it's when performing the "operation" above which asks for memory - and none is given. It's not a very expensive memory operation though).
Is there any way to avoid this in order to scale? Either by replicating behaviour using some other directives to reach the same output in a more scalable way, or to be able to hand to Spark the values for merging (just as I used to do with MR)...
It's inefficient to do condition inside the flatMap operation. You should check the condition outside to create 2 distinct RDDs and deal with them separatedly.
rddPaired.cache();
// groupFilterFunc will filter which items need grouping
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.filter(groupFilterFunc).groupByKey();
// processGroupedValuesFunction should call `operation` on group of all values with the same key and return the result
rddGrouped.mapValues(processGroupedValuesFunction);
// nogroupFilterFunc will filter which items don't need grouping
JavaPairRDD<Key, Iterable<String>> rddNoGrouped = rddPaired.filter(nogroupFilterFunc);
// processNoGroupedValuesFunction2 should call `operation` on a single value and return the result
rddNoGrouped.mapValues(processNoGroupedValuesFunction2);