Multithread - OutOfMemory - java

I am using an ThreadPoolExecutor with 5 active threads, number of tasks is huge 20,000.
The queue is filled up (pool.execute(new WorkingThreadTask())) with instances of a Runnable tasks almost immediately.
Each WorkingThreadTask has a HashMap:
Map<Integer, HashMap<Integer, String>> themap ;
each map can have up to 2000 items, and each sub-map has 5 items. There is also a shared BlockingQueue.
When process is running I am getting out of memory. I'm running with: (32bit -Xms1024m -Xmx1024m)
How can I handle this problem? I don't think I have leaks in hashmap... When the thread is finished hashmap is cleaned right?
Update:
After running a profiler and checking the memory, the biggest hit is:
byte[] 2,516,024 hits, 918 MB
I don't know from where it's called or used.
Name Instance count Size (bytes)
byte[ ] 2519560 918117496
oracle.jdbc.ttc7.TTCItem 2515402 120739296
char[ ] 357882 15549280
java.lang.String 9677 232248
int[ ] 2128 110976
short[ ] 2097 150024
java.lang.Class 1537 635704
java.util.concurrent.locks.ReentrantLock$NonfairSync 1489 35736
java.util.Hashtable$Entry 1417 34008
java.util.concurrent.ConcurrentHashMap$HashEntry[ ] 1376 22312
java.util.concurrent.ConcurrentHashMap$Segment 1376 44032
java.lang.Object[ ] 1279 60216
java.util.TreeMap$Entry 828 26496
oracle.jdbc.dbaccess.DBItem[ ] 802 10419712
oracle.jdbc.ttc7.v8TTIoac 732 52704

I'm not sure about the inner map but I suspect the problem is that you are creating a large number of tasks that is filling memory. You should be using a bounded task queue and limit the job producer.
Take a look at my answer here: Process Large File for HTTP Calls in Java
To summarize it, you should create your own bounded queue and then use a RejectedExecutionHandler to block the producer until there is space in the queue. Something like:
final BlockingQueue<WorkingThreadTask> queue =
new ArrayBlockingQueue<WorkingThreadTask>(100);
ThreadPoolExecutor threadPool =
new ThreadPoolExecutor(nThreads, nThreads, 0L, TimeUnit.MILLISECONDS, queue);
// we need our RejectedExecutionHandler to block if the queue is full
threadPool.setRejectedExecutionHandler(new RejectedExecutionHandler() {
#Override
public void rejectedExecution(WorkingThreadTask task,
ThreadPoolExecutor executor) {
try {
// this will block the producer until there's room in the queue
executor.getQueue().put(task);
} catch (InterruptedException e) {
throw new RejectedExecutionException(
"Unexpected InterruptedException", e);
}
}
});
Edit:
I don't think I have leeks in hashmap... when thread is finished hashmap is cleaned right?
You might consider aggressively calling clear() on the work HashMap and other collections when the task completes. Although they should be reaped by the GC eventually, giving the GC some help may solve your problem if you have limited memory.
If this doesn't work, a profiler is the way to go to help you identify where the memory is being held.
Edit:
After looking at the profiler output, the byte[] is interesting. Typically this indicates some sort of serialization or other IO. You may also be storing blobs in a database. The oracle.jdbc.ttc7.TTCItem is very interesting however. That indicates to me that you are not closing a database connection somewhere. Make sure to use proper try/finally blocks to close your connections.

HashMap carries quite a lot of overhead in terms of memory usage..... it carries about 36 bytes minimum per entry, plus the size of the key/value itself - each will be at least 32 bytes (I think that's about the typical value for 32-bit sun JVM).... doing some quick math:
20,000 tasks, each with map with 2000 entry hashmap. The value in the map is another map with 5 entries.
-> 5-entry map is 1* Map + 5* Map.Object entries + 5*keys + 5*values = 16 objects at 32 bytes => 512 bytes per sub-map.
-> 2000 entry map is 1* Map, 2000*Map.Object + 2000 keys + 2000 submaps (each is 512 bytes) => 2000*(512+32+32) + 32 => 1.1MB
-> 20,000 tasks, each of 1.1MB -> 23GB
So, your overall footprint is 23GB.
The logical solution is to restrict the depth of your blocking queue feeding the ExecutorService, and only create enough child tasks to keep it busy..... set a limit of about 64 entries in the queue, and then you will never have more than 64 + 5 tasks instantiated at one time. When wpace comes available in the executor's queue, you can create and add another task.

You can improve the efficiency by not adding so many tasks ahead of what is being processed. Try checking the queue and only adding to it if there is less than 1000 entries.
You can also make the data structures more efficient. A Map with an Integer key can often be reduced to an array of some kind.
Lastly, 1 GB isn't that much these days. My mobile phone has 2 GB. If you are going to process large amount of data, I suggest getting a machine with 32-64 GB of memory and a 64-bit JVM.

From the large byte[]s, I'd suspect IO related issues (unless you are handling video/audio or something).
Things to look at:
DB: Are you trying to read large amount of stuff at once? You can
e.g. use a cursor to not do that
File/Network: Are you trying to read large amounts of stuff from file/network at once? You should "propagate the load" to whatever is reading and regulate the rate of read.
UPDATE: OK, so you are using a cursor to read from DB. Now you need to make sure that the reading from the cursor only progresses as you finish stuff (aka "propagate the load"). To do this, use a thread pool like this:
BlockingQueue<Runnable> queue = new LinkedBlockingQueue<Runnable>(queueSize);
ThreadPoolExecutor tpe = new ThreadPoolExecutor(
threadNum,
threadNum,
1000,
TimeUnit.HOURS,
queue,
new ThreadPoolExecutor.CallerRunsPolicy());
Now when you post to this service from your code which reads from the DB, it will block when the queue is full (the calling thread is used to run tasks and hence blocks).

Related

Why processing of this Flux hangs indefinitely on size 256?

I need to process events coming from a Flux in groups (by id) so that within an individual group each event is processed sequentially, but groups are processed in parallel. As far as I know, this can be achieved with groupBy and concatMap.
When I implemented this my tests started to hang indefinitely on some big numbers of unique ids. I isolated the problem to the code below and found a specific number on which the code starts to hang - 256. I definitely don't get why this happens at all and where 256 comes from.
Here is the code which hangs:
#ParameterizedTest
#ValueSource(ints = {250, 251, 252, 253, 254, 255, 256})
void freezeTest(int uniqueStringsCount) {
var scheduler = Schedulers
.newBoundedElastic(
1000,
1000,
"really-big-scheduler"
);
Flux.range(0, uniqueStringsCount)
.map(Object::toString)
.repeat()
// this represents "a lot of events"
.take(50_000)
.groupBy(x -> x)
// this gets the same results
// .parallel(400)
.parallel()
.flatMap(group ->
group.concatMap(e ->
// this represents a processing operation on each event
Mono.fromRunnable(() -> {
try {
Thread.sleep(0);
} catch (InterruptedException ex) {
throw new RuntimeException(ex);
}
})
// this also doesn't work
// Mono.delay(Duration.ofMillis(0))
// Mono.empty()
// big scheduler doesn't help either
// ).subscribeOn(scheduler)
)
// ).runOn(scheduler)
).runOn(Schedulers.parallel())
.then()
.block();
}
We first construct a Flux with a lot of (50k, just an example) Strings. But there are only some number of unique strings in that Flux, which is that split up in that number of groups that are processed in parallel. But events within each group are processed sequentially via concatMap. And this code hangs only on 256 unique strings.
Initially, I thought that some thread pool somewhere is exhausted, so I added a really-big-scheduler to test that - but it only executes slower and also hangs on 256.
Then I tried removing blocking Thread.sleep (I started with that since my real implementation is possibly blocking) - but it also hangs on 256
Also, changing parallelism (400 in the code above) doesn't change anything.
Flux.groupBy needs extra care when dealing with a large amount of groups, as stated in its javadoc:
Note that groupBy works best with a low cardinality of groups, so chose your keyMapper function accordingly.
The groups need to be drained and consumed downstream for groupBy to work correctly. Notably when the criteria produces a large amount of groups, it can lead to hanging if the groups are not suitably consumed downstream (eg. due to a flatMap with a maxConcurrency parameter that is set too low).
Here the prefetch amount is set too low: by default it is set to Queues.SMALL_BUFFER_SIZE, which is by default 256 (this can be changed with the property reactor.bufferSize.small). Flux.groupBy has a method to set the prefetch amount manually: Flux.groupBy(Function, int), so I suggest to replace your operator with .groupBy(x -> x, 1024) or another suitable high amount.
The prefetch amount is important as it is the amount of uncompleted items it can process. In your case, first 255 Scheduler.createWorker() calls are made, each item is put on a Worker, and then put it and the created GroupedFlux in groupBy's internal queues waiting for the Worker to complete. When the 256th item appears before any Worker completes, it is unable to put it in the queues, and hangs.

Is there a way to force parallelStream() to go parallel?

If the input size is too small the library automatically serializes the execution of the maps in the stream, but this automation doesn't and can't take in account how heavy is the map operation. Is there a way to force parallelStream() to actually parallelize CPU heavy maps?
There seems to be a fundamental misunderstanding. The linked Q&A discusses that the stream apparently doesn’t work in parallel, due to the OP not seeing the expected speedup. The conclusion is that there is no benefit in parallel processing if the workload is too small, not that there was an automatic fallback to sequential execution.
It’s actually the opposite. If you request parallel, you get parallel, even if it actually reduces the performance. The implementation does not switch to the potentially more efficient sequential execution in such cases.
So if you are confident that the per-element workload is high enough to justify the use of a parallel execution regardless of the small number of elements, you can simply request a parallel execution.
As can easily demonstrated:
Stream.of(1, 2).parallel()
.peek(x -> System.out.println("processing "+x+" in "+Thread.currentThread()))
.forEach(System.out::println);
On Ideone, it prints
processing 2 in Thread[main,5,main]
2
processing 1 in Thread[ForkJoinPool.commonPool-worker-1,5,main]
1
but the order of messages and details may vary. It may even be possible that in some environments, both task may happen to get executed by the same thread, if it can steel the second task before another thread gets started to pick it up. But of course, if the tasks are expensive enough, this won’t happen. The important point is that the overall workload has been split and enqueued to be potentially picked up by other worker threads.
If execution by a single thread happens in your environment for the simple example above, you may insert simulated workload like this:
Stream.of(1, 2).parallel()
.peek(x -> System.out.println("processing "+x+" in "+Thread.currentThread()))
.map(x -> {
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(3));
return x;
})
.forEach(System.out::println);
Then, you may also see that the overall execution time will be shorter than “number of elements”דprocessing time per element” if the “processing time per element” is high enough.
Update: the misunderstanding might be cause by Brian Goetz’ misleading statement: “In your case, your input set is simply too small to be decomposed”.
It must be emphasized that this is not a general property of the Stream API, but the Map that has been used. A HashMap has a backing array and the entries are distributed within that array depending on their hash code. It might be the case that splitting the array into n ranges doesn’t lead to a balanced split of the contained element, especially, if there are only two. The implementors of the HashMap’s Spliterator considered searching the array for elements to get a perfectly balanced split to be too expensive, not that splitting two elements was not worth it.
Since the HashMap’s default capacity is 16 and the example had only two elements, we can say that the map was oversized. Simply fixing that would also fix the example:
long start = System.nanoTime();
Map<String, Supplier<String>> input = new HashMap<>(2);
input.put("1", () -> {
System.out.println(Thread.currentThread());
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(2));
return "a";
});
input.put("2", () -> {
System.out.println(Thread.currentThread());
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(2));
return "b";
});
Map<String, String> results = input.keySet()
.parallelStream().collect(Collectors.toConcurrentMap(
key -> key,
key -> input.get(key).get()));
System.out.println("Time: " + TimeUnit.NANOSECONDS.toMillis(System.nanoTime()- start));
on my machine, it prints
Thread[main,5,main]
Thread[ForkJoinPool.commonPool-worker-1,5,main]
Time: 2058
The conclusion is that the Stream implementation always tries to use parallel execution, if you request it, regardless of the input size. But it depends on the input’s structure how well the workload can be distributed to the worker threads. Things could be even worse, e.g. if you stream lines from a file.
If you think that the benefit of a balanced splitting is worth the cost of a copying step, you could also use new ArrayList<>(input.keySet()).parallelStream() instead of input.keySet().parallelStream(), as the distribution of elements within ArrayList always allows a perflectly balanced split.

neo4j update properties on 10 million nodes

I use the neo4j java core api and want to update 10 million nodes.
I thought it will be better to do it with multithreading but the performance is not that good (35 minutes for setting properties).
To explain: Each node "Person" has at least one relation "POINTSREL" to a "Point" node, which has the property "Points". I want to sum up the points from the "Point" node and set it as property to the "Person" node.
Here is my code:
Transaction transaction = service.beginTx();
ResourceIterator<Node> iterator = service.findNodes(Labels.person);
transaction.success();
transaction.close();
ExecutorService executor = Executors.newFixedThreadPool(5);
while(iterator.hasNext()){
executor.execute(new MyJob(iterator.next()));
}
//wait until all threads are done
executor.shutdown();
try {
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
}
And here the runnable class
private class MyJob implements Runnable {
private Node node;
/* collect useful parameters in the constructor */
public MyJob(Node node) {
this.node = node;
}
public void run() {
Transaction transaction = service.beginTx();
Iterable<org.neo4j.graphdb.Relationship> rel = this.node.getRelationships(RelationType.POINTSREL, Direction.OUTGOING);
double sum = 0;
for(org.neo4j.graphdb.Relationship entry : rel){
try{
sum += (Double)entry.getEndNode().getProperty("Points");
} catch(Exception e){
e.printStackTrace();
}
}
this.node.setProperty("Sum", sum);
transaction.success();
transaction.close();
}
}
Is there a better (faster) way to do that?
About my setting:
AWS Instance with 8 CPUs and 32GB ram
neo4j-wrapper.conf
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=16000
wrapper.java.maxmemory=16000
neo4j.properties
# The type of cache to use for nodes and relationships.
cache_type=soft
cache.memory_ratio=30.0
neostore.nodestore.db.mapped_memory=2G
neostore.relationshipstore.db.mapped_memory=7G
neostore.propertystore.db.mapped_memory=2G
neostore.propertystore.db.strings.mapped_memory=2G
neostore.propertystore.db.arrays.mapped_memory=512M
From my perspective there is something that can be improved.
Offtopic
If you are using Java 7 (or greater) consider using try with resource to handler transaction. It will prevent you from errors.
Performance
First of all - batching. Currently you are:
Creating Job
Starting thread (actually, there is pool in executor)
Starting transaction
For each node. You should consider to make updates in batches. This means that you should:
Collect N nodes (i.e. N=1000)
Create single job for N nodes
Create single transaction in job
Update N nodes in that transaction
Close transaction
Setup
You have 8 CPU's. That means that you can create bigger thread pool. I think Executors.newFixedThreadPool(16) will be OK.
Hacks
You have 32GB RAM. I can suggest:
Decrease java heap size to 8GB. From my experience large heap size can lead to large GC pauses and performance degradation
Increate mapped memory size. Just to make sure that more data can be kept in cache.
Just for your case. If all your data can fit in RAM, then you can change cache_type to hard for this change purposes. Details.
Configuration
As you said - you are using Core API. Is this Embedded graph database or server extension?
If this is Embedded graph database - you should verify that your database settings is applied to created instance.
I found out that there was, amongst others, a problem with the property "cache_type=soft". I set it to "cache_type=none" and the duration of the execution decreased from 30 minutes to 2 minutes. After some updates there were always threads which were blocked for about 30 seconds - changing this property helps to avoid these blockings. I will search for a more detail explantation.

Parallel stream from a HashSet doesn't run in parallel

I have collection of elements that I want to process in parallel. When I use a List, parallelism works. However, when I use a Set, it does not run in parallel.
I wrote a code sample that shows the problem:
public static void main(String[] args) {
ParallelTest test = new ParallelTest();
List<Integer> list = Arrays.asList(1,2);
Set<Integer> set = new HashSet<>(list);
ForkJoinPool forkJoinPool = new ForkJoinPool(4);
System.out.println("set print");
try {
forkJoinPool.submit(() ->
set.parallelStream().forEach(test::print)
).get();
} catch (Exception e) {
return;
}
System.out.println("\n\nlist print");
try {
forkJoinPool.submit(() ->
list.parallelStream().forEach(test::print)
).get();
} catch (Exception e) {
return;
}
}
private void print(int i){
System.out.println("start: " + i);
try {
TimeUnit.SECONDS.sleep(1);
} catch (InterruptedException e) {
}
System.out.println("end: " + i);
}
This is the output that I get on windows 7
set print
start: 1
end: 1
start: 2
end: 2
list print
start: 2
start: 1
end: 1
end: 2
We can see that the first element from the Set had to finish before the second element is processed. For the List, the second element starts before the first element finishes.
Can you tell me what causes this issue, and how to avoid it using a Set collection?
I can reproduce the behavior you see, where the parallelism doesn't match the parallelism of the fork-join pool parallelism you've specified. After setting the fork-join pool parallelism to 10, and increasing the number of elements in the collection to 50, I see the parallelism of the list-based stream rising only to 6, whereas the parallelism of the set-based stream never gets above 2.
Note, however, that this technique of submitting a task to a fork-join pool to run the parallel stream in that pool is an implementation "trick" and is not guaranteed to work. Indeed, the threads or thread pool that is used for execution of parallel streams is unspecified. By default, the common fork-join pool is used, but in different environments, different thread pools might end up being used. (Consider a container within an application server.)
In the java.util.stream.AbstractTask class, the LEAF_TARGET field determines the amount of splitting that is done, which in turn determines the amount of parallelism that can be achieved. The value of this field is based on ForkJoinPool.getCommonPoolParallelism() which of course uses the parallelism of the common pool, not whatever pool happens to be running the tasks.
Arguably this is a bug (see OpenJDK issue JDK-8190974), however, this entire area is unspecified anyway. However, this area of the system definitely needs development, for example in terms of splitting policy, the amount of parallelism available, dealing with blocking tasks, among other issues. A future release of the JDK may address some of these issues.
Meanwhile, it is possible to control the parallelism of the common fork-join pool through the use of system properties. If you add this line to your program,
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "10");
and you run the streams in the common pool (or if you submit them to your own pool that has a sufficiently high level of parallelism set) you will observe that many more tasks are run in parallel.
You can also set this property on the command line using the -D option.
Again, this is not guaranteed behavior, and it may change in the future. But this technique will probably work for JDK 8 implementations for the forseeable future.
UPDATE 2019-06-12: The bug JDK-8190974 was fixed in JDK 10, and the fix has been backported to an upcoming JDK 8u release (8u222).

Extra bytes appearing when building file data using multiple threads

I am working on a large scale dataset and after building a model, I use multithreading (whole project in Java) as follows:
OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));
int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();
// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
KDDCupDataModel.getTestFile(dataFileDirectory))) {
PreferenceArray userTest = tests.getFirst();
callables.add(new Track1Callable(recommender, userTest));
i++;
}
ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();
for (Future<byte[]> result : results) {
for (byte estimate : result.get()) {
out.write(estimate);
}
}
out.flush();
out.close();
When I receive the result from each callable, output it to a file. Does this output in the exact order as the list of initial Callables was made? In spite of some completing before others? Seems it should but not sure.
Also, I expect a total of 6.2 million bytes to be written to the outfile. But I get an additional 2000 bytes (Yeah for free). That messes up my submission and I think it is because of some concurrency issues. I tested this on small dataset and it seems to work fine there (264 bytes expected and received).
Anyhing wrong I am doing with the Executor framework or Futures?
Q: Does the order is the same as the one specified for the tasks? Yes.
From the API:
Returns: A list of Futures
representing the tasks, in the same
sequential order as produced by the
iterator for the given task list. If
the operation did not time out, each
task will have completed. If it did
time out, some of these tasks will not
have completed.
As for the "extra" bytes: have you tried doing all of this in sequential order (i.e., without using an executor) and checking if you obtain different results? It seems that your problem is outside the code provided (and probably is not due to concurrency).
The order in which the callable's are executed doesn't matter from the code you have here. You write the results in the order you store the futures in the list. Even if they were executed in reverse order, the file should appear the same as your file writing is single threaded.
I suspect your callables are interacting with each other and you get different results depending on the number of core you use. e.g. You might be using SimpleDateFormat.
I suggest you run this twice in the same program with a dataset which completes in a short time. Run it first with only one thread in the thread pool and a second time with 24 threads You should be able to compare the results from both runs with Arrays.equals(byte[], byte[]) and see that you get exactly the same results.

Categories