Java streams with 2d Array

Java streams with 2d Array - java

Good afternoon Ive recently learned about steams in my Java class and would like to expand my knowledge.
I am trying to stream through a 2d array, and add the sum of the elements in the previous row to each element in the next row.
I also want to make this stream parallel, but this brings up an issue because if the stream begins working on the second row before the first is finished the the data integrity will be questionable.
Is there anyway to do this in java?

If I understood you correctly, this peace of code does what you are asking for:
int[][] matrix = new int[][]{{1,2,3},{3,2,1},{1,2,3}};
BiConsumer<int[], int[]> intArraysConsumer = (ints, ints2) -> {
for (int i = 0; i < ints.length; i++) {
ints[i] = ints[i] + ints2[i];
}
} ;
int[] collect = Arrays.stream(matrix).collect(() -> new int[matrix[0].length], intArraysConsumer, intArraysConsumer);
System.out.println(Arrays.toString(collect));
This outputs: [5, 6, 7]
For what I understand of the streams api, it will decide if running this in parallel, that's why you need to provide an creator of the object you need starting empty, a way to accumulate on it, and a way to combine two of the partial accumulation objects. For this particular case the accumulate and combine operations are the same.
Please refer to: https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#collect-java.util.function.Supplier-java.util.function.BiConsumer-java.util.function.BiConsumer-

Related

Make parallel API calls using Spring boot/java

I am new to spring boot and learning things as they come by. I have a quick question about making parallel API calls.
I have an array of ids that I will be appending to a 3rd party API endpoint and make GET requests and aggregate the data and make a file out of it once all the 3000 calls get completed.
The catch here is the Array will be of size 3000 i.e. I am expected to make 3000 calls. I feel that using a for loop and iterating over 3000 times doesn't make any sense and it is less efficient.
Can anyone please suggest to me the best and efficient way to do this?
Thank you

You can do something like this:
List<Integer> ids = IntStream.range(1, 13).boxed().collect(Collectors.toList());
int partitionSize = 5;
for (int i = 0; i < ids.size(); i += partitionSize) {
List<Integer> currPartition = ids.subList(i, Math.min(i + partitionSize, ids.size()));
String partitionStr = currPartition.parallelStream()
.map(id -> callAPI(id)) // change the aggeragtion according to your needs
.collect(Collectors.joining(", "));
System.out.println(partitionStr);
}

Optimizing searching techniques for large data sets

I'm currently working on a project where I need to work with a .csv file that is around 3 million lines long and different .xlsx files that range in size between 10 lines long and over 1000 lines long. I'm trying to find commonalities between different cells in my .xlsx file and my .csv file.
To do this. I've read in my .csv file and .xslx file and stored both in ArrayLists.
I have what I want to work, working however the method I'm using is O(n^3) using a 3 nested for loop do search between each.
//This is our .xlsx file stored in an ArrayList
for(int i = 1; i<finalKnowledgeGraph.size(); i+=3) {
//loop through our knowledgeGraph again
for(int j = 1; j<finalKnowledgeGraph.size(); j+=3) {
//loop through .csv file which is stored in an ArrayList
for(int k=1; k<storeAsserions.size(); k++) {
if(finalKnowledgeGraph.get(i).equals(storeAsserions.get(k)) && finalKnowledgeGraph.get(j+1).equals(storeAsserions.get(k+1))){
System.out.println("Do Something");
} else if(finalKnowledgeGraph.get(i+1).equals(storeAsserions.get(k)) && finalKnowledgeGraph.get(j).equals(storeAsserions.get(k+1))) {
System.out.println("Do something else");
}
}
}
}
At the moment in my actual code, my System.out.println("Do something") is just writing specific parts of each file to a new .csv file.
Now, with what I'm doing out of the way my problem is optimization. Obviously if im running a 3 nested for loop over millions of inputs, it won't be finished running in my life time so I'm wondering what ways I can optimize the code.
One of my friends suggested storing the files in memory and so read/writes will be several times quicker. Another friend suggested storing the files in hashtables instead of ArrayLists to help speed up the process but since I'm essentially searching EVERY element in said hashtable I don't see how that's going to speed up the process. It just seems like it's going to transfer the searching from one data structure to another. However I said i'd also post the question here and see if people had any tips/suggestions on how I'd go about optimizing this code. Thanks
Note: I have literally no knowledge of optimization etc. myself and I found other questions on S/O too specific for my knowledge on the field so if the question seems like a duplicate, I've probably seen the question you're talking about already and couldn't understand the content
Edit: Everything stored in both ArrayLists is verb:noun:noun pairs where I'm trying to compare nouns between each ArrayList. Since I'm not concerned with verbs, I start searching at index 1. (Just for some context)

One possible solution would be using a database, which -- given a proper index -- could do the search pretty fast. Assuming the data fit in memory, you can be even faster.
The principle
For problems like
for (X x : xList) {
for (Y y : yList) {
if (x.someAttr() == y.someAttr()) doSomething(x, y);
}
}
you simply partition one list into buckets according to the attribute like
Map<A, List<Y>> yBuckets = new HashMap<>();
yList.forEach(y -> yBuckets.compute(y.someAttr(), (k, v) ->
(v==null ? new ArrayList<>() : v).add(y));
Now, you iterate the other list and only look at the elements in the proper bucket like
for (X x : xList) {
List<Y> smallList = yBucket.get(x.someAttr());
if (smallList != null) {
for (Y y : smallList) {
if (x.someAttr() == y.someAttr()) doSomething(x, y);
}
}
}
The comparison can be actually left out, as it's always true, but that's not the point. The speed comes from eliminating to looking at cases when equals would return false.
The complexity gets reduced from quadratic to linear plus the number of calls to doSomething.
Your case
Your data structure obviously does not fit. You're flattening your triplets into one list and this is wrong. You surely can work around it somehow, but creating a class Triplet {String verb, noun1, noun2} makes everything simpler. For storeAsserions, it looks like you're working with pairs. They seem to overlap, but that may be a typo, anyway it doesn't matter. Let's use Triplets and Pairs.
Let me also rename your lists, so that the code fits better in this tiny window:
for (Triplet x : fList) {
for (Triplet y : fList) {
for (Pair z : sList) {
if (x.noun1.equals(z.noun1) && y.noun2.equals(z.noun2)) {
doSomething();
} else if (x.noun2.equals(z.noun1) && y.noun1.equals(z.noun2)) {
doSomethingElse();
}
}
}
}
Now, we need some loops over buckets, so that at least one of the equals tests is always true, so that we save us dealing with non-matching data. Let's concentrate on the first condition
x.noun1.equals(z.noun1) && y.noun2.equals(z.noun2)
I suggest a loop like
for (Pair z : sList) {
for (Triplet x : smallListOfTripletsHavingNoun1SameAsZ) {
for (Triplet y : smallListOfTripletsHavingNoun2SameAsZ) {
doSomething();
}
}
}
where the small lists get computes like in the first section.
No non-matching entries get ever compared, so the complexity gets reduced from cubic to the number of matches (= to the number if lines you code would print).
Addendum -yBuckets
Let's assume xList looks like
[
{id: 1, someAttr: "a"},
{id: 2, someAttr: "a"},
{id: 3, someAttr: "b"},
]
Then yBuckets should be
{
"a": [
{id: 1, someAttr: "a"},
{id: 2, someAttr: "a"},
],
:b": [
{id: 3, someAttr: "b"},
],
}
One simple way, how to create such a Map is
yList.forEach(y -> yBuckets.compute(y.someAttr(), (k, v) ->
(v==null ? new ArrayList<>() : v).add(y));
In plaintext:
For each y from yList,
get a corresponding map entry in the form of (k, v),
when v is null, then create a new List
otherwise work with the List v
In any case, add y to it
and store it back to the Map (which is a no-op unless when a new List was created in the third step).

Is it better to use arrays or a queue when merging two sorted arrays?

I'm working on a programming practice site that asked to implement a method that merges two sorted arrays. This is my solution:
public static int[] merge(int[] arrLeft, int[] arrRight){
int[] merged = new int[arrRight.length + arrLeft.length];
Queue<Integer> leftQueue = new LinkedList<>();
Queue<Integer> rightQueue = new LinkedList<>();
for(int i = 0; i < arrLeft.length ; i ++){
leftQueue.add(arrLeft[i]);
}
for(int i = 0; i < arrRight.length; i ++){
rightQueue.add(arrRight[i]);
}
int index = 0;
while (!leftQueue.isEmpty() || !rightQueue.isEmpty()){
int largerLeft = leftQueue.isEmpty() ? Integer.MAX_VALUE : leftQueue.peek();
int largerRight = rightQueue.isEmpty() ? Integer.MAX_VALUE : rightQueue.peek();
if(largerLeft > largerRight){
merged[index] = largerRight;
rightQueue.poll();
} else{
merged[index] = largerLeft;
leftQueue.poll();
}
index ++;
}
return merged;
}
But this is the official solution:
public static int[] merge(int[] arrLeft, int[] arrRight){
// Grab the lengths of the left and right arrays
int lenLeft = arrLeft.length;
int lenRight = arrRight.length;
// Create a new output array with the size = sum of the lengths of left and right
// arrays
int[] arrMerged = new int[lenLeft+lenRight];
// Maintain 3 indices, one for the left array, one for the right and one for
// the merged array
int indLeft = 0, indRight = 0, indMerged = 0;
// While neither array is empty, run a while loop to merge
// the smaller of the two elements, starting at the leftmost position of
// both arrays
while(indLeft < lenLeft && indRight < lenRight){
if(arrLeft[indLeft] < arrRight[indRight])
arrMerged[indMerged++] = arrLeft[indLeft++];
else
arrMerged[indMerged++] = arrRight[indRight++];
}
// Another while loop for when the left array still has elements left
while(indLeft < lenLeft){
arrMerged[indMerged++] = arrLeft[indLeft++];
}
// Another while loop for when the right array still has elements left
while(indRight < lenRight){
arrMerged[indMerged++] = arrRight[indRight++];
}
return arrMerged;
}
Apparently, all the other solutions by users on the site did not make use of a queue as well. I'm wondering if using a Queue is less efficient? Could I be penalized for using a queue in an interview for example?

As the question already states that the left and right input arrays are sorted, this gives you a hint that you should be able to solve the problem without requiring a data structure other than an array for the output.
In a real interview, it is likely that the interviewer will ask you to talk through your thought process while you are coding the solution. They may state that they want the solution implemented with certain constraints. It is very important to make sure that the problem is well defined before you start your coding. Ask as many questions as you can think of to constrain the problem as much as possible before starting.
When you are done implementing your solution, you could mention the time and space complexity of your implementation and suggest an alternative, more efficient solution.
For example, when describing your implementation you could talk about the following:
There is overhead when creating the queues
The big O notation / time and space complexity of your solution
You are unnecessarily iterating over every element of the left and right input array to create the queues before you do any merging
etc...
These types of interview questions are common when applying for positions at companies like Google, Microsoft, Amazon, and some tech startups. To prepare for such questions, I recommend you work through problems in books such as Cracking the Coding Interview. The book covers how to approach such problems, and the interview process for these kinds of companies.

Sorry to say but your solution with queues is horrible.
You are copying all elements to auxiliary dynamic data structures (which can be highly costly because of memory allocations), then back to the destination array.
A big "disadvantage" of merging is that it requires twice the storage space as it cannot be done in-place (or at least no the straightforward way). But you are spoiling things to a much larger extent by adding extra copies and overhead, unnecessarily.
The true solution is to copy directly from source to destination, leading to simpler and much more efficient code.
Also note that using a sentinel value (Integer.MAX_VALUE) when one of the queues is exhausted is a false good idea because it adds extra comparisons when you know the outcome in advance. It is much better to split in three loops as in the reference code.
Lastly, your solution can fail when the data happens to contain Integer.MAX_VALUE.

java multithread loop with collecting results

sorry for limited code, as i have quite no idea how to do it, and parts of the code are not a code, just an explanation what i need. The base is:
arrayList<double> resultTopTen = new arrayList<double();
arrayList<double> conditions = new arrayList<double(); // this arrayList can be of a very large size milion+, gets filled by different code
double result = 0;
for (int i = 0, i < conditions.size(), i++){ //multithread this
loopResult = conditions.get(i) + 5;
if (result.size() < 10){
resultTopTen.add(loopResult);
}
else{
//this part i don't know, if this loopResult belongs to the TOP 10 loopResults so far, just by size, replace the smallest one with current, so that i will get updated resultTopTen in this point of loop.
}
}
loopResult = conditions.get(i) + 5; part is just an example, calculation is different, in fact it is not even double, so it is not possible simply to sort conditions and go from there.
for (int i = 0, i < conditions.size(), i++) part means i have to iterate through input condition list, and execute the calculation and get result for every condition in conditionlist, Don't have to be in order at all.
The multithreading part is the thing i have really no idea how to do, but as the conditions arrayList is really large, i would like to calculate it somehow in parallel, as if i do it just as it is in the code in a simple loop in 1 thread, i wont get my computing resources utilized fully. The trick here is how to split the conditions, and then collect result. For simplicity if i would like to do it in 2 threads, i would split conditions in half, make 1 thread do the same loop for 1st half and second for second, i would get 2 resultTopTen, which i can put together afterwards, But much better would be to split the thing in to as many threads as system resources provide(for example until cpu ut <90%, ram <90%). Is that possible?

Use parallel stream of Java 8.
static class TopN<T> {
final TreeSet<T> max;
final int size;
TopN(int size, Comparator<T> comparator) {
this.max = new TreeSet<>(comparator);
this.size = size;
}
void add(T n) {
max.add(n);
if (max.size() > size)
max.remove(max.last());
}
void combine(TopN<T> o) {
for (T e : o.max)
add(e);
}
}
public static void main(String[] args) {
List<Double> conditions = new ArrayList<>();
// add elements to conditions
TopN<Double> maxN = conditions.parallelStream()
.map(d -> d + 5) // some calculation
.collect(() -> new TopN<Double>(10, (a, b) -> Double.compare(a, b)),
TopN::add, TopN::combine);
System.out.println(maxN.max);
}
Class TopN holds top n items of T.
This code prints minimum top 10 in conditions (add 5 to each element).

Let me simplify your question, from what I understand, please confirm or add:
Requirement: You want to find top10 results from list called conditions.
Procedure: You want multiple threads to process your logic of finding the top10 results and accumulate the results to give top10.
Please also share the logic you want to implement to get top10 elements or it is just a descending order of list and it's top 10 elements.

Poor performance processing a pair RDD with very skewed data

I have a pair RDD with millions of key-value pairs, where every value is a list which may contain a single element or billions of elements. This leads to a poor performance since the large groups will block the nodes of the cluster for hours, while groups that would take a few seconds cannot be processed in parallel since the whole cluster is already busy.
Is there anyway to improve this?
EDIT:
The operation that is giving me problems is a flatMap where the whole list for a given key is analyzed. The key is not touched, and the operation compares every element in the list to the rest of the list, which takes a huge amount of time but unfortunately it has to be done. This means that the WHOLE list needs to be in the same node at the same time. The resulting RDD will contain a sublist depending on a value calculated in the flatMap.
I cannot use broadcast variables in this case scenario, as no common data will be used between the different key-value pairs. As for a partitioner, according to the O'Reilly Learning Spark book, this kind of operation will not benefit from a partitioner since no shuffle is involved (although I am not sure if this is true). Can a partitioner help in this situation?
SECOND EDIT:
This is an example of my code:
public class MyFunction implements FlatMapFunction
<Tuple2<String, Iterable<Bean>>, ComparedPerson> {
public Iterable<ProcessedBean> call(Tuple2<Key, Iterable<Bean>> input) throws Exception {
List<ProcessedBean> output = new ArrayList<ProcessedBean>();
List<Bean> listToProcess = CollectionsUtil.makeList(input._2());
// In some cases size == 2, in others size > 100.000
for (int i = 0; i < listToProcess.size() - 1; i++) {
for (int j = i + 1; j < listToProcess.size(); j++) {
ProcessedBean processed = processData(listToProcess.get(i), listToProcess.get(j));
if (processed != null) {
output.add(processed);
}
}
}
return output;
}
The double for will loop n(n-1)/2 times, but this cannot be avoided.

The order in which the keys get processed has no effect on the total computation time. The only issue from variance (some values are small, others are large) I can imagine is at the end of processing: one large task is still running while all other nodes are already finished.
If this is what you are seeing, you could try increasing the number of partitions. This would reduce the size of tasks, so a super large task at the end is less likely.
Broadcast variables and partitioners will not help with the performance. I think you should focus on making the everything-to-everything comparison step as efficient as possible. (Or better yet, avoid it. I don't think quadratic algorithms are really sustainable in big data.)

If 'processData' is expensive, it's possible that you could parallelize that step and pick up some gains there.
In pseudo-code, it would be something like:
def processData(bean1:Bean, bean2:Bean):Option[ProcessedData] = { ... }
val rdd:RDD[(Key, List[Bean])] = ...
val pairs:RDD[(Bean, Bean)] = rdd.flatMap((key, beans) => {
val output = mutable.List[ProcessedBean]()
val len = beans.length
for (var i=0; i < len - 1; i++) {
for (var j=i+1; j < len; j++) {
output.add((beans(i), beans(j)))
}
}
output
}).repartition(someNumber)
val result:RDD[ProcessedBean] = pairs
.map(beans => processData(beans._1, beans._2))
.filter(_.isDefined)
.map(_.get)
The flatMap step will still be bounded by your biggest list, and you'll incur a shuffle when you repartition, but moving the processData step outside of that N^2 step could gain you some parallelism.

Skew like this is often domain specific. You could create your value data as an RDD and join on it. Or you could try using broadcast variables. Or you could write a custom partitioner that might help split the data differently.
But, ultimately, it is going to depend on the computation and specifics of the data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.