Java 8 Stream string of map calls versus combining into one [duplicate] - java

This question already has answers here:
Using multiple map functions vs. a block statement in a map in a java stream
(2 answers)
Closed 3 years ago.
When using Java 8 Stream API, is there a benefit to combining multiple map calls into one, or does it not really affect performance?
For example:
stream.map(SomeClass::operation1).map(SomeClass::operation2);
versus
stream.map(o -> o.operation1().operation2());

The performance overhead here is negligible for most business-logic operations. You have two additional method calls in the pipeline (which may not be inlined by JIT-compiler in real application). Also you have longer call stack (by one frame), so if you have an exception inside stream operation, its creation would be a little bit slower. These things might be significant if your stream performs really low-level operations like simple math. However most of the real problems have much bigger computational cost, so relative performance drop is unlikely to be noticeable. And if you actually perform a simple math and need the performance, it's better to stick with plain old for loops instead. Use the version you find more readable and do not perform the premature optimization.

Related

What is the reason "forEach" in Java Streams API is unordered? [duplicate]

This question already has answers here:
forEach vs forEachOrdered in Java 8 Stream
(4 answers)
Closed 4 years ago.
As far as I'm aware, in parallel streams, methods such findFirst, skip, limit and etc. keep their behaviour as long as stream is ordered (which is by default) whether is't parallel or not. So I was wondering why forEach method is different. I gave it some thought, but I just could not understand the neccessity of defining forEachOrdered method, when it could have been more easier and less surprising to make forEach ordered by default, then call unordered on stream instance and that's it, no need to define new method.
Unfortunately my practical experience with Java 8 is quite limited at this point, so I would really appreciate if someone could explain me reasons for this architectural decision, maybe with some simple examples/use-cases to show me what could go wrong otherwise.
Just to make it clear, I'm not asking about this: forEach vs forEachOrdered in Java 8 Stream. I'm perfectly aware how those methods work and differences between them. What I'm asking about is practical reasons for architectural decision made by Oracle.
Defining a method forEach that would preserve order and unordered that would break it, would complicated things IMO; simply because unordered does nothing more than setting a flag in the stream api internals and the flag checking would have to be performed or enforced based on some conditions.
So let's say you would do:
someStream()
.unordered()
.forEach(System.out::println)
In this case, your proposal is to not print elements in any order, thus enforcing unordered here. But what if we did:
someSet().stream()
.unordered()
.forEach(System.out::println)
In this case would you want unordered to be enforced? After all, the source of a stream is a Set, which has no order, so in this case, enforcing unordered is just useless; but this means additional tests on the source of the stream internally. This can get quite tricky and complicated (as it already is btw).
To make it simpler there were two method defined, that clearly stipulate what they will do; and this is on par for example with findFirst vs findAny or even Optional::isPresent and Optional::isEmpty (added in java-11).
When you process elements of a Stream in parallel, you simply should not expect any guarantees on order.
The whole idea is that multiple threads work on different elements of that stream. They progress individually, therefore the order of processing is not predictable. It is indeterministic, aka random.
I could imagine that the people implementing that interface purposely give you random order, to make it really clear that you shall not expect any distinct order when using parallel streams.
Methods such as findFirst, limit and skip requires the order of input so their behaviour doesn't change whether we use parallel or serial stream. However, forEach as a method do not need any order and thus it's behaviour is different.
For parallel stream pipelines, forEach operation does not guarantee to respect the encounter order of the stream, as doing so would sacrifice the benefit of parallelism.
I would also suggest not to use findFirst, limit and skip with parallel streams as it would reduce performance because of overhead required to order parallel streams.

Why is there no Stream.flatMap(Collection) method? [duplicate]

This question already has an answer here:
Why can't Stream.flatMap accept a collection?
(1 answer)
Closed 5 years ago.
Currently, to convert a List<List<Foo>> to a Stream<Foo>, you have to use the following:
Stream<Foo> stream = list.stream().flatMap(fs -> fs.stream());
//or
Stream<Foo> stream = list.stream().flatMap(Collection::stream);
I think this is exactly what the method references were designed for, and it does improve readability quite a bit. Now consider this:
Stream<Bar> stream = list.stream().flatMap(fs -> fs.getBarList().stream());
Having two chained method calls, no method reference is possible, and I've had this happen to me a few times. While it is not a big issue, it seems to drift away from the method-reference brevity.
Having worked with JavaFX 8 a bit, I noticed that a constant of their API's is the convenience methods. Java is a very verbose language, and it seemed to me that simple method overloads were a big selling point for JavaFX.
So my question is, I wonder why there is no convenience method Stream.flatMap(Collection) that could be called like:
Stream<Bar> stream = list.stream().flatMap(Foo::getBarList);
Is this an intentional omission by the folks at Oracle? Or could this cause any confusion?
Note: I'm aware of the "no-opinion-based-questions policy," and I'm not looking for opinions, I'm just wondering if there is a reason that such a method is not implemented.
Because Stream is already a pretty big interface and there's resistance to making it bigger.
Because there's also the workaround list.stream().map(Foo::getBarList).flatMap(List::stream).
You can also see the original discussion at http://mail.openjdk.java.net/pipermail/lambda-libs-spec-observers/2013-April/001690.html ; I'm not seeing that option specifically discussed; it may have been discarded already at that point?

Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization? [duplicate]

This question already has answers here:
Spark 2.0 Dataset vs DataFrame
(3 answers)
Closed 5 years ago.
To take advantage of Dataset's optimization, do I have to explicitly use Dataframe's methods (e.g. df.select(col("name"), col("age"), etc) or calling any Dataset's methods - even RDD-like methods (e.g. filter, map, etc) would also allow for optimization?
Dataframe optimization comes in general in 3 flavors:
Tungsten memory management
Catalyst query optimization
wholestage codegen
Tungsten memory management
When defining an RDD[myclass], spark has no real understanding of what myclass is. This means that in general each row will contain an instance of the class.
This has two problems.
The first is the size of the object. A java object has overheads. For example, a case class which contains two simple integers. Doing a sequence of 1000000 instances and turning it into an RDD would take ~26MB while doing the same with dataset/dataframe would take ~2MB.
In addition, this memory when done in dataset/dataframe is not managed by garbage collection (it is managed as unsafe memory internally by spark) and so would have less overhead in GC performance.
Dataset enjoys the same memory management advantages of dataframe. That said, when doing dataset operations, the conversion of the data from the internal (Row) data structure to case class has an overhead in performance.
Catalyst query optimization
When using dataframes functions, spark knows what you are trying to do and sometimes can modify your query to an equivalent one which is more efficient.
Let's say for example that you are doing something like:
df.withColumn("a",lit(1)).filter($"b" < ($"a" + 1)).
Basically you are checking if (x < 1 + 1). Spark is smart enough to understand this and change it to x<2.
These kind of operations cannot be done when using dataset operations as spark has no idea on the internals of the functions you are doing.
wholestage codegen
When spark knows what you are doing it can actually generate more efficient code. This can improve performance by a factor of 10 in some cases.
This also cannot be done on dataset functions as spark does not know the internals of the functions.

Double-checked locking as an anti-pattern [duplicate]

This question already has answers here:
Java double checked locking
(11 answers)
Closed 8 years ago.
There's a common belief and multiple sources (including wiki) that claim this idiom to be an anti-pattern.
What are the arguments against using it in production code given the correct implementation is used (for example, using volatile)
What are the appropriate alternatives for implementing lazy initialization in a multithreaded environment ? Locking the whole method may become a bottleneck and even while modern synchronization is relatively cheap, it's still much slower especially under contention. Static holder seems to be a language-specific and a bit ugly hack (at least for me). Atomics-based implementation seems not be so different from traditional DCL while allowing multiple calculations or requires more complicated code. For example, Scala is still using DCL for implementing the lazy values while proposed alternative seems to be much more complicated.
Don't use double checked locking. Ever. It does not work. Don't try to find a hack to make it work, because it may not on a later JRE.
As far as I know, there is no other save way for lazy initialization than locking the whole object / synchronizing.
synchronized (lock) {
// lookup
// lazy init
}
For singletons the static holder (as #trashgod mentioned) is nice, but will not remain single if you have multiple classloaders.
If you require a lazy singleton in a multi-classloader environment, use the ServiceLoader.

Java - List or Array? [duplicate]

Want to improve this post? Provide detailed answers to this question, including citations and an explanation of why your answer is correct. Answers without enough detail may be edited or deleted.
This question already has answers here:
What does it mean to "program to an interface"?
(33 answers)
When to use a List over an Array in Java?
(9 answers)
Closed 9 years ago.
I know Lists make things much easier in Java instead of working with hard-set arrays (lists allow you to add/remove elements at will and they automagically resize, etc).
I've read some stuff suggesting that array's should be avoided in java whenever possible since they are not flexible at all (and sometimes can impose weird limitations, such as if you don't know the size an array needs to be, etc).
Is this "good practice" to stop using arrays at all and use only List logic instead? I'm sure the List type consumes more memory than an array and thus have higher overhead, but is this significant? Most Lists would be GC'ed during runtime if they are left laying around anyways so maybe it isn't as big of a deal as I'm thinking?
I don't like dogma. Know the rules; know when to break the rules.
"Never" is too strong, especially when it comes to software.
Arrays and Lists are both potential targets for the GC, so that's a wash.
Yes, you have to know the size of an array before you start. For the cases when you do, there's nothing wrong with it.
It's easy to go back and forth as needed using java.util.Collections and java.util.Arrays classes.
I think a good rule of thumb would be to use Lists unless you NEED an Array (for memory/performance reasons). Otherwise Lists are typically easier to maintain and thus less likely to cause future bugs.
Lists provide more flexibility/functionality in terms of auto-expansion, so unless you are either pressed for memory (and can't afford the overhead that Lists create) or do not mind maintaining the Array size as it expands/shrinks, I would recommend Lists.
Try not to micromanage the code too much, and instead focus on more discernible and readable components.
It depends on the list. A LinkedList probably takes up space only as it's needed, while an ArrayList typically greatly increases its space whenever its capacity is reached. Internally, an ArrayList is implemented using an array, but it's an array that's always larger than what you want. However, since it stores references, not objects, the memory overhead is negligible in most cases and the convenience is worth it, I believe.
I would have to say I follow this approach of using the Collections framework where I might otherwise have used an array. The collections offer you many benefits and convenience over arrays but yes there is likely to be some performance hit.
It is better to write code that is easy to understand and hard to break, arrays require you to put in a lot of checking code to ensure you don't access bits of the array you shouldn't or put to many things in it etc. Given that the majority of the time performance is not a problem it shouldn't be a concern.

Categories