Deduplication for String intern method in ConcurrentHashMap

Deduplication for String intern method in ConcurrentHashMap - java

I watched a code from JavaDays, author said that this approach with probability is very effective for storing Strings like analogue to String intern method
public class CHMDeduplicator<T> {
private final int prob;
private final Map<T, T> map;
public CHMDeduplicator(double prob) {
this.prob = (int) (Integer.MIN_VALUE + prob * (1L << 32));
this.map = new ConcurrentHashMap<>();
}
public T dedup(T t) {
if (ThreadLocalRandom.current().nextInt() > prob) {
return t;
}
T exist = map.putIfAbsent(t, t);
return (exist == null) ? t : exist;
}
}
Please, explain me, what is effect of probability in this line:
if (ThreadLocalRandom.current().nextInt() > prob) return t;
This is original presentation from Java Days https://shipilev.net/talks/jpoint-April2015-string-catechism.pdf
(56th slide)

If you look at the next slide which has a table with data with different probabilities, or listen to the talk, you will see/hear the rationale: probabilistic deduplicators balance the time spent deduplicating the Strings, and the memory savings coming from the deduplication. This allows to fine-tune the time spent processing Strings, or even sprinkle the low-prob deduplicators around the code thus amortizing the deduplication costs.
(Source: these are my slides)

The double value passed to the constructor is intended to be a probability value in the range 0.0 to 1.0. It is converted to an integer such that the proportion of integer values below it is equal to the double value.
The whole expression is designed to evaluate to true with a probability equal to that of the constructor parameter. By using integer math it will be slightly faster than if the raw double value were used.
The intention of implementation is that sometimes it won't cache the String, instead just returning it. The reason for doing this is a CPU vs memory performance trade off: if the memory-saving caching process causes a CPU bottleneck, you can turn up the "do nothing" probability until you find a balance.

Related

Adding an Immutable Array Slows Down a Thread

I have encountered a bit of a paradox that I am trying to understand. Basically I have two variants of an object in a threaded setting - the variants only differ in that one has an immutable array of immutable objects of fixed length, and yet this second variant is considerable slower than the first. Here is the set up:
final class Object {
public Pair<Long, ImmutableThing> cache,
public ImmutableThing getThing(long timestamp) {
if (timestamp > cache.getKey()) {
ImmutableThing newThing = doExpensiveComputation(timestamp);
cache = new Pair(newThing.getLong(), newThing);
return newThing; }
else { return cache.getValue()}
This first version shows much better performance for the getThing method: It looks up the cache, if the data is valid it returns it, otherwise does a fairly expensive computation, updates the cache, and returns the new value. I understand this is not thread safe as written, but here is the second variant:
final class SlowerObject {
public Pair<Long, ImmutableThing> cache;
public final ArrayList[ImmutableThing] timelineOfThings;
public ImmutableThing getThing(long timestamp) {
if (timestamp > cache.getKey()) {
ImmutableThing newThing = findInTimelineOfThings(timestamp);
cache = new Pair(newThing.getLong(), newThing);
return newThing; }
else { return cache.getValue()}
In this second variant, we pre-compute an array which stores all the possible values of the things we want to return from getThing (there are only 4 possibilities in my case). Instead of doing a computation if the cache is invalid, we just lookup in the array until we find the correct one, and the computation to figure out which is correct is nearly instant - just comparing long values. The array is never rewritten, just read.
This is all occurring in a threaded environment. Why should the second one be slower?

how to get sum and lowest date from the List<> with lambda

public class BusinessLog {
private Date logDate;
private double prize;
}
Given the object BusinessLog, I need to have sum for all the prizes in List<BusinessLog> list also need to return the lowest date from the list, is it possible with lambda,
I can do it with forEach for sure, but how to do it with lambda,
What I tried to do until now is ,
BigDecimal balance = BigDecimal.ZERO;
if (list != null) {
list.forEach(businessLog -> {
balance.add(BigDecimal.valueOf(businessLog.getPrize()));
// how to get lowest date
});
}

A solution using streaming
To find the minimum date, you can do something like this:
Optional<Date> minDate = list.stream().map(v -> v.logDate).min(Date::compareTo);
And to calculate the sum:
double sum = list.stream().mapToDouble(v -> v.prize).sum();
I wouldn't worry about "optimizing" this and trying to do it in one loop unless this is provably a major bottleneck in your system (unlikely). Keeping the two ideas separate makes the code easier to understand and maintain.
Your use of BigDecimal
Your code for doing balance.add(...) to a BigDecimal won't actually work the way you've written it because the add method on BigDecimal returns a new instance rather than mutating the existing instance. BigDecimal instances are immutable. You can't assign a new value to balance because it's effectively final from the context of the lambda.
The idea of using BigDecimal is a good one though. You should avoid using double for anything where exact decimal places are important (e.g. money). If you change prize to a BigDecimal you can't use sum() but you can use reduce() to fulfil the same function.
BigDecimal sum = list.stream().map(v -> v.prize).reduce(BigDecimal.ZERO, BigDecimal::add);

To compute multiple custom aggregations, you should write a custom reducer.
For example, the following reducer calculates the stats in one BusinessLog object. The following supposes a constructor public BusinessLog(double prize, Date logDate) to be declared.
BusinessLog stats =
bl.stream()
.reduce((log1, log2) -> new BusinessLog(
log1.getPrize() + log2.getPrize(),
log1.getLogDate().before(log2.getLogDate()) ?
log1.getLogDate() : log2.getLogDate()
)).get();
Date lowestDate = stats.getLogDate();
double prizeSum = stats.getPrize();
Please note that Using BusinessLog as a temporary stats holder is essentially a hack. You'd need to design a separate class for that purpose.

What is the fastest and most concise/correct way to implement this model class backed by values in a 2-dimensional array?

I solved this problem using a graph, but unfortunately now I'm stuck with having to use a 2d array and I have questions about the best way to go about this:
public class Data {
int[][] structure;
public data(int x, int y){
structure = new int[x][y]
}
public <<TBD>> generateRandom() {
// This is what my question is about
}
}
I have a controller/event handler class:
public class Handler implements EventHandler {
#Override
public void onEvent(Event<T> e) {
this.dataInstance.generateRandom();
// ... other stuff
}
}
Here is what each method will do:
Data.generateRandom() will generate a random value at a random location in the 2d int array if there exists a value in the structure that in not initialized or a value exists that is equal to zero
If there is no available spot in the structure, the structure's state is final (i.e. in the literal sense, not the Java declaration)
This is what I'm wondering:
What is the most efficient way to check if the board is full? Using a graph, I was able to check if the board was full on O(1) and get an available yet also random location on worst-case O(n^2 - 1), best case O(1). Obviously now with an array improving n^2 is tough, so I'm just now focusing on execution speed and LOC. Would the fastest way to do it now to check the entire 2d array using streams like:
Arrays.stream(board).flatMapToInt(tile -> tile.getX()).map(x -> x > 0).count() > board.getWidth() * board.getHeight()

(1) You can definitely use a parallel stream to safely perform read only operations on the array. You can also do an anyMatch call since you are only caring (for the isFull check) if there exists any one space that hasn't been initialized. That could look like this:
Arrays.stream(structure)
.parallel()
.anyMatch(i -> i == 0)
However, that is still an n^2 solution. What you could do, though, is keep a counter of the number of spaces possible that you decrement when you initialize a space for the first time. Then the isFull check would always be constant time (you're just comparing an int to 0).
public class Data {
private int numUninitialized;
private int[][] structure;
public Data(int x, int y) {
if (x <= 0 || y <= 0) {
throw new IllegalArgumentException("You can't create a Data object with an argument that isn't a positive integer.");
}
structure = new int[x][y];
int numUninitialized = x * y;
}
public void generateRandom() {
if (isFull()) {
// do whatever you want when the array is full
} else {
// Calculate the random space you want to set a value for
int x = ThreadLocalRandom.current().nextInt(structure.length);
int y = ThreadLocalRandom.current().nextInt(structure[0].length);
if (structure[x][y] == 0) {
// A new, uninitialized space
numUninitialized--;
}
// Populate the space with a random value
structure[x][y] = ThreadLocalRandom.current().nextInt(Integer.MIN_VALUE, Integer.MAX_VALUE);
}
}
public boolean isFull() {
return 0 == numUninitialized;
}
}
Now, this is with my understanding that each time you call generateRandom you take a random space (including ones already initialized). If you are supposed to ONLY choose a random uninitialized space each time it's called, then you'd do best to hold an auxiliary data structure of all the possible grid locations so that you can easily find the next random open space and to tell if the structure is full.
(2) What notification method is appropriate for letting other classes know the array is now immutable? It's kind of hard to say as it depends on the use case and the architecture of the rest of the system this is being used in. If this is an MVC application with a heavy use of notifications between the data model and a controller, then an observer/observable pattern makes a lot of sense. But if your application doesn't use that anywhere else, then perhaps just having the classes that care check the isFull method would make more sense.
(3) Java is efficient at creating and freeing short lived objects. However, since the arrays can be quite large I'd say that allocating a new array object (and copying the data) over each time you alter the array seems ... inefficient at best. Java has the ability to do some functional types of programming (especially with the inclusion of lambdas in Java 8) but only using immutable objects and a purely functional style is kind of like the round hole to Java's square peg.

Java 8: get average of more than one attribute [duplicate]

This question already has answers here:
How to compute average of multiple numbers in sequence using Java 8 lambda
(7 answers)
Closed 6 years ago.
In the following class:
I want to get average of foo and bar in List<HelloWorld> helloWorldList
#Data
public class HelloWorld {
private Long foo;
private Long bar;
}
OPTION 1: JAVA
Long fooSum, barSum;
for(HelloWorld hw: helloWorldList){
fooSum += hw.getFoo();
barSum += hw.getBar();
}
Long fooAvg = fooSum/helloWorldList.size();
Long barAvg = barSum/helloWorldList.size();
OPTION 2 : JAVA 8
Double fooAvg = helloWorldList.stream().mapToLong(HelloWorld::foo).average().orElse(null);
Double barAvg = helloWorldList.stream().mapToLong(HelloWorld::bar).average().orElse(null);
Which approach is better ?
Is there any better way to get these values ?
Answer edit: This question has been marked duplicate but after reading comments from bradimus i ended up implementing this:
import java.util.function.Consumer;
public class HelloWorldSummaryStatistics implements Consumer<HelloWorld> {
#Getter
private int fooTotal = 0;
#Getter
private int barTotal = 0;
#Getter
private int count = 0;
public HelloWorldSummaryStatistics() {
}
#Override
public void accept(HelloWorld helloWorld) {
fooTotal += helloWorld.getFoo();
barTotal += helloWorld.getBar();
count++;
}
public void combine(HelloWorldSummaryStatistics other) {
fooTotal += other.fooTotal;
barTotal += other.barTotal;
count += other.count;
}
public final double getFooAverage() {
return getCount() > 0 ? (double) getFooTotal() / getCount() : 0.0d;
}
public final double getBarAverage() {
return getCount() > 0 ? (double) getBarTotal() / getCount() : 0.0d;
}
#Override
public String toString() {
return String.format(
"%s{count=%d, fooAverage=%f, barAverage=%f}",
this.getClass().getSimpleName(),
getCount(),
getFooAverage(),
getBarAverage());
}
}
Main Class:
HelloWorld a = new HelloWorld(5L, 1L);
HelloWorld b = new HelloWorld(5L, 2L);
HelloWorld c = new HelloWorld(5L, 4L);
List<HelloWorld> hwList = Arrays.asList(a, b, c);
HelloWorldSummaryStatistics helloWorldSummaryStatistics = hwList.stream()
.collect(HelloWorldSummaryStatistics::new, HelloWorldSummaryStatistics::accept, HelloWorldSummaryStatistics::combine);
System.out.println(helloWorldSummaryStatistics);
Note: As suggested by others if you need high precision BigInteger etc can be used.

The answers/comments you got so far don't mention one advantage of a streams-based solution: just by changing stream() to parallelStream() you could turn the whole thing into a multi-threaded solution.
Try doing that with "option 1"; and see how much work it would need.
But of course, that would mean even more "overhead" in terms of "things going on behind the covers costing CPU cycles"; but if you are talking about large datasets it might actually benefit you.
At least you could very easily see how turning on parallelStreams() would influence execution time!

If you want to find average value in list of integers it is better to use classic approach with iterating.
Streams have some overhead and JVM has to load classes for stream usage. But also JVM has JIT with lots of optimizations.
Please beware of incorrect banchmarking. Use JMH
Streams are good and effective when your iteration operation is not such a simple thing as two integers sum.
Also streams allow you to parallelize code. There is no direct criteria when parallelize is better than single thread. As for me - if function call takes over 100ms - you can parrallelize it.
So, if your dataset processing takes >100ms try parallelStream
If not - use iterating.
P.S. Doug Lea - "When to use parallel streams"

Which approach is better ?
When you say "better", do you mean "closer to the sample's true average" or "more efficient" or what? If efficiency is your goal, streams entail a fair amount of overhead that is often ignored. However, they provide readability and conciser code. It depends upon what you're trying to maximize, how large your datasets are, etc.
Perhaps rephrase the question?

Additional 'if checks' if the value is already set up - what is faster, what uses more resources?

Assume that we have a given interface:
public interface StateKeeper {
public abstract void negateWithoutCheck();
public abstract void negateWithCheck();
}
and following implementations:
class StateKeeperForPrimitives implements StateKeeper {
private boolean b = true;
public void negateWithCheck() {
if (b == true) {
this.b = false;
}
}
public void negateWithoutCheck() {
this.b = false;
}
}
class StateKeeperForObjects implements StateKeeper {
private Boolean b = true;
#Override
public void negateWithCheck() {
if (b == true) {
this.b = false;
}
}
#Override
public void negateWithoutCheck() {
this.b = false;
}
}
Moreover assume that methods negate*Check() can be called 1+ many times and it is hard to say what is the upper bound of the number of calls.
The question is which method in both implementations is 'better'
according to execution speed, garbage collection, memory allocation, etc. -
negateWithCheck or negateWithoutCheck?
Does the answer depend on which from the two proposed
implementations we use or it doesn't matter?
Does the answer depend on the estimated number of calls? For what count of number is better to use one or first method?

There might be a slight performance benefit in using the one with the check. I highly doubt that it matters in any real life application.
premature optimization is the root of all evil (Donald Knuth)
You could measure the difference between the two. Let me emphasize that these kind of things are notoriously difficult to measure reliably.
Here is a simple-minded way to do this. You can hope for performance benefits if the check recognizes that the value doesn't have to be changed, saving you an expensive write into the memory. So I have changed your code accordingly.
interface StateKeeper {
public abstract void negateWithoutCheck();
public abstract void negateWithCheck();
}
class StateKeeperForPrimitives implements StateKeeper {
private boolean b = true;
public void negateWithCheck() {
if (b == false) {
this.b = true;
}
}
public void negateWithoutCheck() {
this.b = true;
}
}
class StateKeeperForObjects implements StateKeeper {
private Boolean b = true;
public void negateWithCheck() {
if (b == false) {
this.b = true;
}
}
public void negateWithoutCheck() {
this.b = true;
}
}
public class Main {
public static void main(String args[]) {
StateKeeper[] array = new StateKeeper[10_000_000];
for (int i=0; i<array.length; ++i)
//array[i] = new StateKeeperForObjects();
array[i] = new StateKeeperForPrimitives();
long start = System.nanoTime();
for (StateKeeper e : array)
e.negateWithCheck();
//e.negateWithoutCheck();
long end = System.nanoTime();
System.err.println("Time in milliseconds: "+((end-start)/1000000));
}
}
I get the followings:
check no check
primitive 17ms 24ms
Object 21ms 24ms
I didn't find any performance penalty of the check the other way around when the check is always superfluous because the value always has to be changed.
Two things: (1) These timings are unreliable. (2) This benchmark is far from any real life application; I had to make an array of 10 million elements to actually see something.
I would simply pick the function with no check. I highly doubt that in any real application you would get any measurable performance benefit from the function that has the check but that check is error prone and is harder to read.

Short answer: the Without check will always be faster.
An assignment takes a lot less computation time than a comparison. Therefore: an IF statement is always slower than an assignment.
When comparing 2 variables, your CPU will fetch the first variable, fetch the second variable, compare those 2 and store the result into a temporary register. That's 2 fetches, 1 compare and a 1 store.
When you assign a value, your CPU will fetch the value on the right hand of the '=' and store it into the memory. That's 1 fetch and 1 store.

In general, if you need to set some state, just set the state. If, on the otherhand, you have to do something more - like log the change, inform about the change, etc. - then you should first inspect the old value.
But, in the case when methods like the ones you provided are called very intensely, there may be some performance difference in checking vs non-checking (whether the new value is different). Possible outcomes are:
1-a) check returns false
1-b) check returns true, value is assigned
2) value is assigned without check
As far as I know, writing is always slower than reading (all the way down to register level), so the fastest outcome is 1-a. If your case is that the most common thing that happens is that the value will not be changed ('more than 50%' logic is just not good enough, the exact percentage has to be figured out empirically) - then you should go with checking, as this eliminates redundant writing operation (value assignment). If, on the other hand, value is different more than often - assign it without checking.
You should test your concrete cases, do some profiling, and based on the result determine the best implementation. There is no general "best way" for this case (apart from "just set the state").
As for boolean vs Boolean here, I would say (off the top of my head) that there should be no performance difference.

Only today I've seen few answers and comments repeating that
Premature optimization is the root of all evil
Well obviously one if statement more is one thing more to do, but... it doesn't really matter.
And garbage collection and memory allocation... not an issue here.

I would generally consider the negateWithCheck to be slightly slower due there always being a comparison. Also notice in the StateKeeperOfObjects you are introducing some autoboxing. 'true' and 'false' are primitive boolean values.
Assuming you fix the StateKeeperOfObjects to use all objects, then potentially, but most likely not noticeable.
The speed will depend slightly on the number of calls, but in general the speed should be considered to be the same whether you call it once or many times (ignoring secondary effects such as caching, jit, etc).
It seems to me, a better question is whether or not the performance difference is noticeable. I work on a scientific project that involves millions of numerical computations done in parallel. We started off using Objects (e.g. Integer, Double) and had less than desirable performance, both in terms of memory and speed. When we switched all of our computations to primitives (e.g. int, double) and went over the code to make sure we were not introducing anything funky through autoboxing, we saw a huge performance increase (both memory and speed).
I am a huge fan of avoiding premature optimization, unless it is something that is "simple" to implement. Just be wary of the consequences. For example, do you have to represent null values in your data model? If so, how do you do that using a primitive? Doubles can be done easily with NaN, but what about Booleans?

negateWithoutCheck() is preferable because if we consider the number of calls then negateWithoutCheck() has only one call i.e. this.b = false; where as negateWithCheck() has one extra with previous one.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Deduplication for String intern method in ConcurrentHashMap - java

Related

Adding an Immutable Array Slows Down a Thread

how to get sum and lowest date from the List<> with lambda

What is the fastest and most concise/correct way to implement this model class backed by values in a 2-dimensional array?

Java 8: get average of more than one attribute [duplicate]

Additional 'if checks' if the value is already set up - what is faster, what uses more resources?

Categories

Resources