Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a problem with Java 8 streams, where the data is processed in sudden bulks, rather than when they are requested. I have a rather complex stream-flow which has to be parallelised because I use concat to merge two streams.
My issue stems from the fact that data seems to be parsed in large bulks minutes - and sometimes even hours - apart. I would expect this processing to happen as soon as the Stream reads incoming data, to spread the workload. Bulk processing seems counterintuitive in almost every way.
So, the question is why this bulk-collection occurs and how I can avoid it.
My input is a Spliterator of unknown size and I use a forEach as the terminal operation.
It’s a fundamental principle of parallel streams that the encounter order doesn’t have to match the processing order. This enables concurrent processing of items of sublists or subtrees while assembling a correctly ordered result, if necessary. This explicitly allows bulk processing and even makes it mandatory for the parallel processing of ordered streams.
This behavior is determined by the particular implementation of the Spliterator’s trySplit implementation. The specification says:
If this Spliterator is ORDERED, the returned Spliterator must cover a strict prefix of the elements
…
API Note:
An ideal trySplit method efficiently (without traversal) divides its elements exactly in half, allowing balanced parallel computation.
Why was this strategy fixed in the specification and not, e.g. an even/odd split?
Well, consider a simple use case. A list will be filtered and collected into a new list, thus the encounter order must be retained. With the prefix rule, it’s rather easy to implement. Split off a prefix, filter both chunks concurrently, afterwards, add the result of the prefix filtering to the new list, followed by adding the filtered suffix.
With an even odd strategy, that’s impossible. You may filter both parts concurrently, but afterwards, you don’t know how to join the results correctly unless you track each items position throughout the entire operation.
Even then, joining these geared items would be much more complicated than performing an addAll per chunk.
You might have noticed that this all applies only, if you have an encounter order that might have to be retained. If your spliterator doesn’t report an ORDERED characteristic, it is not required to return a prefix. Nevertheless, the default implementation you might have inherited by AbstractSpliterator is designed to be compatible with ordered spliterators. Thus, if you want a different strategy, you have to implement the split operation yourself.
Or you use a different way of implementing an unordered stream, e.g.
Stream.generate(()->{
LockSupport.parkNanos(TimeUnit.SECONDS.toNanos(1));
return Thread.currentThread().getName();
}).parallel().forEach(System.out::println);
might be closer to what you expected.
Related
Suppose you have a lazily-populated Iterator which you would like to convert to a Stream. It can be assumed that the number of elements in the Iterator is finite.
It is possible to instruct the Iterator to skip elements (via a skip method on the implementation) to avoid unecessary element generation, but all elements of the Iterator must have been either skipped or generated once the Stream has had a terminal operation called on it.
I am aware of:
StreamSupport.stream(
Spliterators.spliteratorUnknownSize(
theIterator, Spliterator.ORDERED), false);
for generating a Stream from an Iterator, however this does not provide a means to skip or ensure consumption of all elements.
Is there any basic Stream implementations which would allow for this via reasonably simple extension?
Delegation seems messy given the sheer number of methods of the Stream interface with no promise as to which will call which others as part of their internal implementation, and all the JDK implementations one could use as a starting point (at least in the JDK I'm using) are package-private and I assume there's good reason for that (e.g. they don't want me extending them...).
Well I took a swing at doing this. The idea was to create a TypeAdapter for Streams when reading through gson for json arrays, without having to convert every element:
https://github.com/BeUndead/gson-stream
From my quick round of testing it does everything I wanted it to, but there's almost certainly issues I didn't come across when quickly testing it.
This question already has answers here:
forEach vs forEachOrdered in Java 8 Stream
(4 answers)
Closed 4 years ago.
As far as I'm aware, in parallel streams, methods such findFirst, skip, limit and etc. keep their behaviour as long as stream is ordered (which is by default) whether is't parallel or not. So I was wondering why forEach method is different. I gave it some thought, but I just could not understand the neccessity of defining forEachOrdered method, when it could have been more easier and less surprising to make forEach ordered by default, then call unordered on stream instance and that's it, no need to define new method.
Unfortunately my practical experience with Java 8 is quite limited at this point, so I would really appreciate if someone could explain me reasons for this architectural decision, maybe with some simple examples/use-cases to show me what could go wrong otherwise.
Just to make it clear, I'm not asking about this: forEach vs forEachOrdered in Java 8 Stream. I'm perfectly aware how those methods work and differences between them. What I'm asking about is practical reasons for architectural decision made by Oracle.
Defining a method forEach that would preserve order and unordered that would break it, would complicated things IMO; simply because unordered does nothing more than setting a flag in the stream api internals and the flag checking would have to be performed or enforced based on some conditions.
So let's say you would do:
someStream()
.unordered()
.forEach(System.out::println)
In this case, your proposal is to not print elements in any order, thus enforcing unordered here. But what if we did:
someSet().stream()
.unordered()
.forEach(System.out::println)
In this case would you want unordered to be enforced? After all, the source of a stream is a Set, which has no order, so in this case, enforcing unordered is just useless; but this means additional tests on the source of the stream internally. This can get quite tricky and complicated (as it already is btw).
To make it simpler there were two method defined, that clearly stipulate what they will do; and this is on par for example with findFirst vs findAny or even Optional::isPresent and Optional::isEmpty (added in java-11).
When you process elements of a Stream in parallel, you simply should not expect any guarantees on order.
The whole idea is that multiple threads work on different elements of that stream. They progress individually, therefore the order of processing is not predictable. It is indeterministic, aka random.
I could imagine that the people implementing that interface purposely give you random order, to make it really clear that you shall not expect any distinct order when using parallel streams.
Methods such as findFirst, limit and skip requires the order of input so their behaviour doesn't change whether we use parallel or serial stream. However, forEach as a method do not need any order and thus it's behaviour is different.
For parallel stream pipelines, forEach operation does not guarantee to respect the encounter order of the stream, as doing so would sacrifice the benefit of parallelism.
I would also suggest not to use findFirst, limit and skip with parallel streams as it would reduce performance because of overhead required to order parallel streams.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a set of private methods that are used in a main public method (that receive a list) of a class, these methods will mainly use java 8 classic stream operations as filter, map, count e.t.c. I am wondering if creating stream single time in public api and passing to others method instead of passing list have any performance benefits or considerations as .stream() is called single time.
Calling stream() or any intermediate operation would actually do nothing, as streams are driven by the terminal operation.
So passing a Stream internally from one method to another is not bad IMO, might make the code cleaner. But dont return a Stream externally from your public methods, return a List instead ( plz read the supplied comments, might not hold for all cases)
Also think of the case that applying filter for example and then collecting to a toList and then streaming again that filtered List to only map later is obviously a bad choice... You are collecting too soon, so dont chain methods like this even internally.
In general it's best to ask for what is actually needed by the method. If every time you receive a list you make it a stream, ask for the stream instead (streams can come from things other than lists). This enhances portability and reusability and lets consumers of your api know upfront the requirements.
Ask for exactly what you need, nothing more, nothing less.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We know Java has fail-fast and fail-safe iteration design.
I want to know , why we need these basically. Is there any fundamental reason for this separation? . What is the benefit of this design?.
Other than collection iteration , Java follows this design anywhere else?
Thanks
For a programmer it is useful that a code, if it contains a bug, tries to fail as fast as possible. The best would be at compile time but that is sometimes not possible.
Because of that an error can be detected earlier. Otherwise it could happen that an error remains undetected and is contained in your software for years and nobody notices the bug.
So the goal is to detect errors and bugs as early as possible. Methods that accept parameters therefore should immediately check if they are within allowed range, this would be a step towards fail-fast.
Those design-goals are everywhere, not just at iterations, simply because it is a good idea. But sometimes you want to make a trade-off with the usability or other factors.
As an example the Set implementations allow null values which is the source for many bugs. The new Java 9 collections do not allow this and immediately throw an exception when trying to add such, this is fail-fast.
Your iterators are also a nice example. We have some fail-fast iterators, they do not allow modifications to the collection while iterating it. The main reason is because it could easily be the source of errors. Thus they throw a ConcurrentModificationException.
In the special case of iterations someone needs to specify how things should work. How should the iteration behave when a new element comes or gets removed while iterating? Should it be included/removed in the iteration or should it be ignored?
As an example, ListIterator provides a very well-defined option to iterate List and also manipulate it. There you are only allowed to manipulate the object the iterator is currently at, this makes it well-defined.
On the other side there are some fail-safe iteration methods like using a ConcurrentHashMap. They do not throw such exceptions and allow modifications. However they work on a copy of the original collection which also solves the "how" question. So changes on the map are not reflected in the iterator, the iterator completely ignores any changes while iterating. This comes at an increased cost, a complete copy must be created each time you iterate.
For me personally those fail-safe variants are not an option. I use the fail-fast variants and memorize what I want to manipulate. After the iteration has finished I execute those memorized manipulations.
This is less comfortable to write for the programmer but you get the advantage of fail-fast. This is what I meant with trade-off with usability.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Is it bad practice to directly manipulate data like:
Sorter.mergeSort(testData); //(testData is now sorted)
Or should I create A copy of the data and then manipulate and return that like:
sortedData = Sorter.mergeSort(testData); // (sortedData is now sorted and testData remains unsorted)?
I have several sorting methods and I want to be consistent in the way they manipulate data. With my insertionSort method I can directly work on the unsorted data. However, if I want to leave the unsorted data untouched then I would have to create a copy of the unsorted data in the insertionSort method and manipulate and return that (which seems rather unnecessary). On the other hand in my mergeSort method I need to create a copy of the unsorted data one way or another so I ended up doing something that also seems rather unnecessary as a work around to returning a new sortedList:
List <Comparable> sorted = mergeSortHelper(target);
target.clear();
target.addAll(sorted);`
Please let me know which is the better practice, thanks!
It depends whether you're optimising for performance or functional purity. Generally in Java functional purity is not emphasised, for example Collections.Sort sorts the list you give it (even though it's implemented by making an array copy first).
I would optimise for performance here, as that seems more like typical Java, and anyone who wants to can always copy the collection first, like Sorter.mergeSort(new ArrayList(testData));
The best practice is to be consistent.
Personally I prefer my methods to not modify the input parameters since it might not be appropriate in all situations (you're pushing the responsibility onto the end user to make a copy if they need to preserve the original ordering).
That being said, there are clear performance benefits of modifying the input (especially for large lists). So this might be appropriate for your application.
As long as the functionality is clear to the end user you're covered either way!
In Java I usually provide both options (when writing re-usable utility methods, anyway):
/** Return a sorted copy of the data from col. */
public List<T> mergeSort(Collection<T extends Comparable<T>> col);
/** Sort the data in col in place. */
public void mergeSortIn(List<T extends Comparable<T>> col);
I'm making some assumptions re the signatures and types here. That said, the Java norm is - or at least, has been* - generally to mutate state in place. This is often a dangerous thing, especially across API boundaries - e.g. changing a collection passed to your library by its 'client' code. Minimizing the overall state-space and mutable state in particular is often the sign of a well designed application/library.
It sounds like you want to re-use the same test data. To do that I would write a method that builds the test data and returns it. That way, if I need the same test data again in a different test (i.e. to test your mergeSort() / insertionSort() implementations on the same data), you just build and return it again. I commonly do exactly this in writing unit tests (in JUnit, for example).
Either way, if your code is a library class/method for other people to use you should document its behaviour clearly.
Aside: in 'real' code there shouldn't really be any reason to specify that merge sort is the implementation used. The caller should care what it does, not how it does it - so the name wouldn't usually be mergeSort(), insertionSort(), etc.
(*) In some of the newer JVM languages there has been a conscious movement away from mutable data. Clojure has NO mutable state at all as it is a pure functional programming language (at least in normal, single-threaded application development). Scala provides a parallel set of collection libraries that do not mutate the state of collections. This has major advantages in multi-threaded, multi-processor applications. This is not as time/space expensive as might be naively expected, due to the clever algorithms the collections use.
In your particular case, modifying the "actual" data is more efficient. You are sorting data, it is observed that its more efficient to work on sorted data rather than unsorted data. So, I don't see why you should keep the unsorted data. check out Why is it faster to process a sorted array than an unsorted array?
Mutable object should be manipulated in the functions. Like Arrays#sort
But immutable objects (like String), can only return the "new" objects. Like String#replace