Make parallel API calls using Spring boot/java - java

I am new to spring boot and learning things as they come by. I have a quick question about making parallel API calls.
I have an array of ids that I will be appending to a 3rd party API endpoint and make GET requests and aggregate the data and make a file out of it once all the 3000 calls get completed.
The catch here is the Array will be of size 3000 i.e. I am expected to make 3000 calls. I feel that using a for loop and iterating over 3000 times doesn't make any sense and it is less efficient.
Can anyone please suggest to me the best and efficient way to do this?
Thank you

You can do something like this:
List<Integer> ids = IntStream.range(1, 13).boxed().collect(Collectors.toList());
int partitionSize = 5;
for (int i = 0; i < ids.size(); i += partitionSize) {
List<Integer> currPartition = ids.subList(i, Math.min(i + partitionSize, ids.size()));
String partitionStr = currPartition.parallelStream()
.map(id -> callAPI(id)) // change the aggeragtion according to your needs
.collect(Collectors.joining(", "));
System.out.println(partitionStr);
}

Related

Java streams with 2d Array

Good afternoon Ive recently learned about steams in my Java class and would like to expand my knowledge.
I am trying to stream through a 2d array, and add the sum of the elements in the previous row to each element in the next row.
I also want to make this stream parallel, but this brings up an issue because if the stream begins working on the second row before the first is finished the the data integrity will be questionable.
Is there anyway to do this in java?
If I understood you correctly, this peace of code does what you are asking for:
int[][] matrix = new int[][]{{1,2,3},{3,2,1},{1,2,3}};
BiConsumer<int[], int[]> intArraysConsumer = (ints, ints2) -> {
for (int i = 0; i < ints.length; i++) {
ints[i] = ints[i] + ints2[i];
}
} ;
int[] collect = Arrays.stream(matrix).collect(() -> new int[matrix[0].length], intArraysConsumer, intArraysConsumer);
System.out.println(Arrays.toString(collect));
This outputs: [5, 6, 7]
For what I understand of the streams api, it will decide if running this in parallel, that's why you need to provide an creator of the object you need starting empty, a way to accumulate on it, and a way to combine two of the partial accumulation objects. For this particular case the accumulate and combine operations are the same.
Please refer to: https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#collect-java.util.function.Supplier-java.util.function.BiConsumer-java.util.function.BiConsumer-

Is it better to use arrays or a queue when merging two sorted arrays?

I'm working on a programming practice site that asked to implement a method that merges two sorted arrays. This is my solution:
public static int[] merge(int[] arrLeft, int[] arrRight){
int[] merged = new int[arrRight.length + arrLeft.length];
Queue<Integer> leftQueue = new LinkedList<>();
Queue<Integer> rightQueue = new LinkedList<>();
for(int i = 0; i < arrLeft.length ; i ++){
leftQueue.add(arrLeft[i]);
}
for(int i = 0; i < arrRight.length; i ++){
rightQueue.add(arrRight[i]);
}
int index = 0;
while (!leftQueue.isEmpty() || !rightQueue.isEmpty()){
int largerLeft = leftQueue.isEmpty() ? Integer.MAX_VALUE : leftQueue.peek();
int largerRight = rightQueue.isEmpty() ? Integer.MAX_VALUE : rightQueue.peek();
if(largerLeft > largerRight){
merged[index] = largerRight;
rightQueue.poll();
} else{
merged[index] = largerLeft;
leftQueue.poll();
}
index ++;
}
return merged;
}
But this is the official solution:
public static int[] merge(int[] arrLeft, int[] arrRight){
// Grab the lengths of the left and right arrays
int lenLeft = arrLeft.length;
int lenRight = arrRight.length;
// Create a new output array with the size = sum of the lengths of left and right
// arrays
int[] arrMerged = new int[lenLeft+lenRight];
// Maintain 3 indices, one for the left array, one for the right and one for
// the merged array
int indLeft = 0, indRight = 0, indMerged = 0;
// While neither array is empty, run a while loop to merge
// the smaller of the two elements, starting at the leftmost position of
// both arrays
while(indLeft < lenLeft && indRight < lenRight){
if(arrLeft[indLeft] < arrRight[indRight])
arrMerged[indMerged++] = arrLeft[indLeft++];
else
arrMerged[indMerged++] = arrRight[indRight++];
}
// Another while loop for when the left array still has elements left
while(indLeft < lenLeft){
arrMerged[indMerged++] = arrLeft[indLeft++];
}
// Another while loop for when the right array still has elements left
while(indRight < lenRight){
arrMerged[indMerged++] = arrRight[indRight++];
}
return arrMerged;
}
Apparently, all the other solutions by users on the site did not make use of a queue as well. I'm wondering if using a Queue is less efficient? Could I be penalized for using a queue in an interview for example?
As the question already states that the left and right input arrays are sorted, this gives you a hint that you should be able to solve the problem without requiring a data structure other than an array for the output.
In a real interview, it is likely that the interviewer will ask you to talk through your thought process while you are coding the solution. They may state that they want the solution implemented with certain constraints. It is very important to make sure that the problem is well defined before you start your coding. Ask as many questions as you can think of to constrain the problem as much as possible before starting.
When you are done implementing your solution, you could mention the time and space complexity of your implementation and suggest an alternative, more efficient solution.
For example, when describing your implementation you could talk about the following:
There is overhead when creating the queues
The big O notation / time and space complexity of your solution
You are unnecessarily iterating over every element of the left and right input array to create the queues before you do any merging
etc...
These types of interview questions are common when applying for positions at companies like Google, Microsoft, Amazon, and some tech startups. To prepare for such questions, I recommend you work through problems in books such as Cracking the Coding Interview. The book covers how to approach such problems, and the interview process for these kinds of companies.
Sorry to say but your solution with queues is horrible.
You are copying all elements to auxiliary dynamic data structures (which can be highly costly because of memory allocations), then back to the destination array.
A big "disadvantage" of merging is that it requires twice the storage space as it cannot be done in-place (or at least no the straightforward way). But you are spoiling things to a much larger extent by adding extra copies and overhead, unnecessarily.
The true solution is to copy directly from source to destination, leading to simpler and much more efficient code.
Also note that using a sentinel value (Integer.MAX_VALUE) when one of the queues is exhausted is a false good idea because it adds extra comparisons when you know the outcome in advance. It is much better to split in three loops as in the reference code.
Lastly, your solution can fail when the data happens to contain Integer.MAX_VALUE.

java multithread loop with collecting results

sorry for limited code, as i have quite no idea how to do it, and parts of the code are not a code, just an explanation what i need. The base is:
arrayList<double> resultTopTen = new arrayList<double();
arrayList<double> conditions = new arrayList<double(); // this arrayList can be of a very large size milion+, gets filled by different code
double result = 0;
for (int i = 0, i < conditions.size(), i++){ //multithread this
loopResult = conditions.get(i) + 5;
if (result.size() < 10){
resultTopTen.add(loopResult);
}
else{
//this part i don't know, if this loopResult belongs to the TOP 10 loopResults so far, just by size, replace the smallest one with current, so that i will get updated resultTopTen in this point of loop.
}
}
loopResult = conditions.get(i) + 5; part is just an example, calculation is different, in fact it is not even double, so it is not possible simply to sort conditions and go from there.
for (int i = 0, i < conditions.size(), i++) part means i have to iterate through input condition list, and execute the calculation and get result for every condition in conditionlist, Don't have to be in order at all.
The multithreading part is the thing i have really no idea how to do, but as the conditions arrayList is really large, i would like to calculate it somehow in parallel, as if i do it just as it is in the code in a simple loop in 1 thread, i wont get my computing resources utilized fully. The trick here is how to split the conditions, and then collect result. For simplicity if i would like to do it in 2 threads, i would split conditions in half, make 1 thread do the same loop for 1st half and second for second, i would get 2 resultTopTen, which i can put together afterwards, But much better would be to split the thing in to as many threads as system resources provide(for example until cpu ut <90%, ram <90%). Is that possible?
Use parallel stream of Java 8.
static class TopN<T> {
final TreeSet<T> max;
final int size;
TopN(int size, Comparator<T> comparator) {
this.max = new TreeSet<>(comparator);
this.size = size;
}
void add(T n) {
max.add(n);
if (max.size() > size)
max.remove(max.last());
}
void combine(TopN<T> o) {
for (T e : o.max)
add(e);
}
}
public static void main(String[] args) {
List<Double> conditions = new ArrayList<>();
// add elements to conditions
TopN<Double> maxN = conditions.parallelStream()
.map(d -> d + 5) // some calculation
.collect(() -> new TopN<Double>(10, (a, b) -> Double.compare(a, b)),
TopN::add, TopN::combine);
System.out.println(maxN.max);
}
Class TopN holds top n items of T.
This code prints minimum top 10 in conditions (add 5 to each element).
Let me simplify your question, from what I understand, please confirm or add:
Requirement: You want to find top10 results from list called conditions.
Procedure: You want multiple threads to process your logic of finding the top10 results and accumulate the results to give top10.
Please also share the logic you want to implement to get top10 elements or it is just a descending order of list and it's top 10 elements.

Poor performance processing a pair RDD with very skewed data

I have a pair RDD with millions of key-value pairs, where every value is a list which may contain a single element or billions of elements. This leads to a poor performance since the large groups will block the nodes of the cluster for hours, while groups that would take a few seconds cannot be processed in parallel since the whole cluster is already busy.
Is there anyway to improve this?
EDIT:
The operation that is giving me problems is a flatMap where the whole list for a given key is analyzed. The key is not touched, and the operation compares every element in the list to the rest of the list, which takes a huge amount of time but unfortunately it has to be done. This means that the WHOLE list needs to be in the same node at the same time. The resulting RDD will contain a sublist depending on a value calculated in the flatMap.
I cannot use broadcast variables in this case scenario, as no common data will be used between the different key-value pairs. As for a partitioner, according to the O'Reilly Learning Spark book, this kind of operation will not benefit from a partitioner since no shuffle is involved (although I am not sure if this is true). Can a partitioner help in this situation?
SECOND EDIT:
This is an example of my code:
public class MyFunction implements FlatMapFunction
<Tuple2<String, Iterable<Bean>>, ComparedPerson> {
public Iterable<ProcessedBean> call(Tuple2<Key, Iterable<Bean>> input) throws Exception {
List<ProcessedBean> output = new ArrayList<ProcessedBean>();
List<Bean> listToProcess = CollectionsUtil.makeList(input._2());
// In some cases size == 2, in others size > 100.000
for (int i = 0; i < listToProcess.size() - 1; i++) {
for (int j = i + 1; j < listToProcess.size(); j++) {
ProcessedBean processed = processData(listToProcess.get(i), listToProcess.get(j));
if (processed != null) {
output.add(processed);
}
}
}
return output;
}
The double for will loop n(n-1)/2 times, but this cannot be avoided.
The order in which the keys get processed has no effect on the total computation time. The only issue from variance (some values are small, others are large) I can imagine is at the end of processing: one large task is still running while all other nodes are already finished.
If this is what you are seeing, you could try increasing the number of partitions. This would reduce the size of tasks, so a super large task at the end is less likely.
Broadcast variables and partitioners will not help with the performance. I think you should focus on making the everything-to-everything comparison step as efficient as possible. (Or better yet, avoid it. I don't think quadratic algorithms are really sustainable in big data.)
If 'processData' is expensive, it's possible that you could parallelize that step and pick up some gains there.
In pseudo-code, it would be something like:
def processData(bean1:Bean, bean2:Bean):Option[ProcessedData] = { ... }
val rdd:RDD[(Key, List[Bean])] = ...
val pairs:RDD[(Bean, Bean)] = rdd.flatMap((key, beans) => {
val output = mutable.List[ProcessedBean]()
val len = beans.length
for (var i=0; i < len - 1; i++) {
for (var j=i+1; j < len; j++) {
output.add((beans(i), beans(j)))
}
}
output
}).repartition(someNumber)
val result:RDD[ProcessedBean] = pairs
.map(beans => processData(beans._1, beans._2))
.filter(_.isDefined)
.map(_.get)
The flatMap step will still be bounded by your biggest list, and you'll incur a shuffle when you repartition, but moving the processData step outside of that N^2 step could gain you some parallelism.
Skew like this is often domain specific. You could create your value data as an RDD and join on it. Or you could try using broadcast variables. Or you could write a custom partitioner that might help split the data differently.
But, ultimately, it is going to depend on the computation and specifics of the data.

Java ArrayList looking for multiple strings

Let's say I have two same strings inside an ArrayList... is there a way to check for that? Also is there a way to check for how many times a string of the same exact is in the ArrayList?
So let's say I have the following a ArrayString.
os.println(itemIDdropped + "|" + spawnX + "|" + spawnY + "|" + currentMap + "|drop|" + me.getUsername());
1. 1|3|5|1|drop|Dan
2. 2|5|7|2|drop|Luke
3. 1|3|5|2|drop|Dan
4. 3|3|5|1|drop|Sally
Here is what the numbers/letters mean for the 1-4 strings...
item ID, X pos, Y pos, Map it's on, command drop, user who dropped it
Then let's say I split it up doing this:
String[] itemGrnd = serverItems.get(i).split("\\|");
Now, let's say I have a for-loop like this one:
for (int i = 0; x < serverItems.size(); i++) {
System.out.println(serverItems.get(i));
}
I want to find where X, Y, and Map or in this case itemGrnd[1], itemGrnd[2], and itemGrnd[3] are the same in ANY other String in the ArrayList I found via serverItems.get(i).
And if WE DO find any... (which in the example I provide above... x, y, and map are the same for 1 and 4... then create an IF statment to do it ONCE... because I don't want to do this:
(BTW I do keep track of my variables for client-side so don't worry about that. Yes I know it's named spawnX and spawnY.. that's just the current X,Y.
if (spawnX.equals(Integer.parseInt(itemGrnd[1])) &&
spawnY.equals(Integer.parseInt(itemGrnd[2])) &&
currentMap.equals(Integer.parseInt(itemGrnd[3]))) {
}
Now, if I were to do THIS.... 1 and 4 were be processed through here. I Only want do it ONCE.. at ALL times if we find multple strings (like 1 and 4)
Thanks
Well, it's pretty simple to find out how many String ArrayList has:
public class ArrayListExample {
private static final String TO_FIND = "cdt";
public static void main(String[] args) {
ArrayList<String> al = new ArrayList<String>();
al.add("abc");
al.add("dde");
//4 times
al.add(TO_FIND);
al.add(TO_FIND);
al.add(TO_FIND);
al.add(TO_FIND);
ArrayList<String> al1 = (ArrayList<String>) al.clone();
int count = 0;
while (al1.contains(TO_FIND)) {
al1.remove(TO_FIND);
count++;
}
System.out.println(count);
}
}
A faster way would be to sort the ArrayList and then see how many different elements are there.
Denis_k solution is the O(n^2) and with sorting it is O(nlog(n)).
If you want that kind of behavior, you might want to check the Apache Commons / Collections project, which extends the default java collections by some useful interfaces, including the bag interface (JavaDoc), which seems to be what you are looking for: a collection that knows how many elements of one kind it contains. The downside: there is no support for generics yet, so you are working with raw collections.
Or Google Guava, which is a more modern library and has a similar concept, called MultiSet which is probably more user-friendly. Guava however does not yet have a release version although it is already widely used in production.
Oh, and there's also a standard JavaSE function: Collections.frequency(Collection, Object), although it supposedly performs badly.

Categories