Increase CPU utilization using parallel stream for DB intensive task

Increase CPU utilization using parallel stream for DB intensive task - java

I am using java 8 parallel stream to insert data into DB.
The following is the code
customers.parallelStream().forEach(t->{
UserTransaction userTransaction = new UserTransactionImp();
try {
userTransaction.begin();
userTransaction.setTransactionTimeout(300);
//CODE to write data to DB for each customer in a global transaction using atomikos and hibernate
userTransaction.commit();
}catch(Exception e){
userTransaction.rollback();
}
});
it takes more than 2 hours to complete the task.I did the same test running in two different instances(two java main methods).The time taken to complete came down to 1 hour.Is there any other way to scale up within one java intance.I am using Atomikos,Hibernate for persistence.I had configured batching,inserts ordering and update ordering.Evrything is batched properly and working fine .
But I observed that the CPU is not utilized more than 30% during this.Is there any way to utilize more Processors and scaling it up .

parallelStream() basically gives you a "default" implementation. I heard a guy once say: "whenever you use this construct, measure its effects".
In other words: when you are not happy with the default implementation, you might have to look into your own implementation. Not focused on that single operation but the "whole picture".
Example: what if you "badge" together 5, 10, 50 "users" per "shot" - meaning: you reduce the number of transactions, but you allow more content to go into each.
Yes, that is a pretty generic answer - but this is a pretty generic question. We have absolutely no insights what your code is doing there - so nobody here can tell what would be the "perfect" way to reduce overall runtime.
Beyond that: you want to profile your whole setup. Maybe your problem is not the "java" part - but your database. Not enough memory, too much workload ... or network, or, or, or. In other words: first focus on gaining an understanding where your performance bottleneck truly exists.
( a good read about "performance" and bottlenecks: the old classic "Release it" by Michael Nygard )

Related

apache beam pipeline ingesting "Big" input file (more than 1GB) doesn't create any output file

Regarding the dataflow model of computation, I'm doing a PoC to test a few concepts using apache beam with the direct-runner (and java sdk). I'm having trouble creating a pipeline which reads a "big" csv file (about 1.25GB) and dumping it into an output file without any particular transformation like in the following code (I'm mainly concerned with testing IO bottlenecks using this dataflow/beam model because that's of primary importance for me):
// Example 1 reading and writing to a file
Pipeline pipeline = Pipeline.create();
PCollection<String> output = ipeline
.apply(TextIO.read().from("BIG_CSV_FILE"));
output.apply(
TextIO
.write()
.to("BIG_OUTPUT")
.withSuffix("csv").withNumShards(1));
pipeline.run();
The problem that I'm having is that only smaller files do work, but when the big file is used, no output file is being generated (but also no error/exception is shown either, which makes debugging harder).
I'm aware that on the runners page of the apache-beam project (https://beam.apache.org/documentation/runners/direct/), it is explicitly stated under the memory considerations point:
Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with
data sets small enough to fit in local memory. You can create a small
in-memory data set using a Create transform, or you can use a Read
transform to work with small local or remote files.
This above suggests I'm having a memory problem (but sadly isn't being explicitly stated on the console, so I'm just left wondering here). I'm also concerned with their suggestion that the dataset should fit into memory (why isn't it reading from the file in parts instead of fitting the whole file/dataset into memory?)
A 2nd consideration I'd like to also add into this conversation would be (in case this is indeed a memory problem): How basic is the implementation of the direct runner? I mean, it isn't hard to implement a piece of code that reads from a big file in chunks, and also outputs to a new file (also in chunks), so that at no point in time the memory usage becomes a problem (because neither file is completely loaded into memory - only the current "chunk"). Even if the "direct-runner" is more of a prototyping runner to test semantics, would it be too much to expect that it should deal nicely with huge files? - considering that this is a unified model built for the ground up to deal with streaming where window size is arbitrary and huge data accumulation/aggregation before sinking it is a standard use-case.
So more than a question I'd deeply appreciate your feedback/comments regarding any of these points: have you notice IO constraints using the direct-runner? Am I overlooking some aspect or is the direct-runner really so naively implemented? Have you verified that by using a proper production runner like flink/spark/google cloud dataflow, this constraint disapears?
I'll eventually test with other runners like the flink or the spark one, but it feels underwhelming that the direct-runner (even if it is intended only for prototyping purposes) is having trouble with this first test I'm running on - considering the whole dataflow idea is based around ingesting, processing, grouping and distributing huge amounts of data under the umbrella of an unified batch/streaming model.
EDIT (to reflect Kenn's feedback):
Kenn, thanks for those valuable points and feedback, they have been of great help in pointing me towards relevant documentation. By your suggestion I've found out by profiling the application that the problem is indeed a java heap related one (that somehow is never shown on the normal console - and only seen on the profiler). Even though the file is "only" 1.25GB in size, internal usage goes beyond 4GB before dumping the heap, suggesting the direct-runner isn't "working by chunks" but is indeed loading everything in memory (as their doc says).
Regarding your points:
1- I believe that serialization and shuffling can very well still be achieved through a "chunk by chunk" implementation. Maybe I had a false expectation of what the direct-runner should be capable of, or I didn't fully grasp its intended reach, for now I'll refrain of doing non-functional type of tests while using the direct-runner.
2 - Regarding sharding. I believe the NumOfShards controls the parallelism (and amount of output files) at the write stage (processing before that should still be fully parallel, and only at the time of writing, will it use as many workers -and generate as many files- as explicitly provided). Two reasons to believe this are: first, the CPU profiler always show 8 busy "direct-runner-workers" -mirroring the amount of logical cores that my PC has-, independently on if I set 1 shard or N shards. The 2nd reason is what I understand from the documentation here (https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/WriteFiles.html) :
By default, every bundle in the input PCollection will be processed by
a FileBasedSink.WriteOperation, so the number of output will vary
based on runner behavior, though at least 1 output will always be
produced. The exact parallelism of the write stage can be controlled
using withNumShards(int), typically used to control how many files
are produced or to globally limit the number of workers connecting to
an external service. However, this option can often hurt performance:
it adds an additional GroupByKey to the pipeline.
One interesting thing here is that "additional GroupByKey added to the pipeline" is kind of undesired in my use case (I only desire results in 1 file, without any regard for order or grouping),
so probbly adding an extra "flatten" files step, after having the N sharded output files generated is a better approach.
3 - your suggestion for profiling was spot on, thanks.
Final Edit the direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles everything in memory

There are a few issues or possibilities. I will answer in priority order.
The direct runner is for testing with very small data. It is engineered for maximum quality assurance, with performance not much of a priority. For example:
it randomly shuffles data to make sure you are not depending on ordering that will not exist in production
it serializes and deserializes data after each step, to make sure the data will be transmitted correctly (production runners will avoid serialization as much as possible)
it checks whether you have mutated elements in forbidden ways, which would cause you data loss in production
The data you are describing is not very big, and the DirectRunner can process it eventually in normal circumstances.
You have specified numShards(1) which explicitly eliminates all parallelism. It will cause all of the data to be combined and processed in a single thread, so it will be slower than it could be, even on the DirectRunner. In general, you will want to avoid artificially limiting parallelism.
If there is any out of memory error or other error preventing processing, you should see a lot message. Otherwise, it will be helpful to look at profiling and CPU utilization to determine if processing is active.

This question has been indirectly answered by Kenn Knowles above. The direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles every dataset in memory. Performance testing should be carried on by using other runners (like Flink Runner), - those will provide data splitting and the type of infrastructure needed to deal with high IO bottlenecks.
UPDATE: adding to the point adressed by this question, there is a related question here: How to deal with (Apache Beam) high IO bottlenecks?
Whereas the question here revolves around figuring out if the direct runner can deal with huge datasets (which we already established here that it is not possible); the provided link above points to a discussion of weather production runners (like flink/spark/cloud dataflow) can deal natively out of the box with huge datasets (the short answer is yes, but please check yourself on the link for a deeper discussion).

The fastest way to populate a In Memory Data Grid Hazelcast

What is the fastest way to populate a Hazelcast Data Grid. Reading through documentation I can see couple of variants:
Use multithreading and IMap.set
Use multithreading and IMap.putAll
Use a Distributed Execution in order to start populating the grid from all participants.
My performance benchmark shows that IMap.putAll is faster than IMap.Set. But it is stated in the Hazelcasty Documentation that IMap.putAll does not come with guarantees that everything will be inserted atomically.
Can someone clarify a little bit about what would be the fastest way to populate a data grid with data ?
Is variant number 3 good ?

I would see the same three options. Anyhow as you mentioned, option two does not guarantee that everything was put into the map atomically but if you just load data and wait for all threads to finish loading data using IMap::putAll you should be fine.
Apart from that IMap::set would be the alternative. In any case you want to multithread the loading process. I would play around a bit with different thread numbers and loading data from a client is normally recommended to keep nodes free for storage operations.
I personally never benchmarked your third option, anyhow it would be possible as well. Just not sure it is worth the additional work.
How much data do you want to load that you're concerned it could be slow? Do you already know that loading is slow? Do you use Java Serialization (this is a huge performance killer)? Do you use indexes (those have to be generated while putting data)?
There's normally a lot of optimizations to apply to speed up, not only, data loading but also normal operation.

Does db computation worth Parallelizing?

I use DB2 9.7.5 64Bits. The server has enough memory but no clustering.
I need to make huge computations : compute several (roughly 20) ratios in my db. Some of them can take as long as 25 seconds.
The results are stored in a result table.
Now I have several solutions (As a policy, we exclude Stored Proc).
I call each ratio, one at a time from a java client OR
I call several ratios in a multi threaded java client.
My assumption is that it is useless to call from a multi threaded since my db is the bottleneck. But I'm not wholly sure that the db engine really gives 100% of the cpu for 1 query. I think that the engine must probably be able to share its cpu power between several queries.
I am currently reading the IBM Data manual but would like to have your feedback.
Many thanks.

I need to make huge computations : compute several (roughly 20) ratios in my db. Some of them can take as long as 25 seconds.
25 seconds is not necessarily a bad thing. maybe its a wonderful result, depends on what you compute
Now I have several solutions (As a policy, we exclude Stored Proc).
Stored proc are not evil, you just need to know how to use them safely
My assumption is that it is useless to call from a multi threaded since my db is the bottleneck. But I'm not wholly sure that the db engine really gives 100% of the cpu for 1 query. I think that the engine must probably be able to share its cpu power between several queries.
multithreading in java never hearts (as long as you keep the threads safe), especially useful in your case when you are doing alot of calculations.
I don's use db2 so I don't know how good it is on multithreading, but if its single threaded I doubt that it will ever reach 100% cpu usage. you should check the conf files of your db2 to tweek it a little bit
Also read the article about IBM DB2 clustering
I also suggest using a data warehouse tool to analyze your script performance againest the db2
Good luck

Take a look at Materialized Query Tables. If what you are working with is reporting, and especially doesn't require absolutely up-to-date information, you can set up MQTs that will contain the parts that are heavy to calculate with for instance hourly versions.

how to resolve plenty of concurrent write operation in Oracle?

I am maintaining a lottery website with more than millions of users. Some active user(Perhaps more than 30,000) will buy more than 1000 lotteries within 1 second.
Now the current logics use select .... for update to make sure the account balance, but meantime the database server is over-loaded and very slow to deal with? We have to process them in real-time.
Have anyone met the similar scene before?

First, you need to design a transactional system that satisfies your business rules. For the moment, forget about disk and memory, and what goes where. Try to design a system that is as lightweight as possible, that does the minimum required amount of locking, that satisfies your business rules.
Now, run the system, what happens? If performance is acceptable, congratulations, you're done.
If performance is not acceptable, avoid the temptation to guess at the problem, and start making adjustments. You need to profile the system. You need to understand where the most time is being spent, so that you know what areas to focus your tuning efforts on. The easiest way to do this, is to trace it, using SQL_TRACE. You've not made any mention of Oracle edition, version, or platform. So, I'll assume you're at least on some version of 10gR2. So, use DBMS_MONITOR to start/end traces. Now, scoping is important here. What I mean is, it's critically important that you start the trace, run the code that you want to profile and then immediately shut off the trace. This way, you trace only what you're interested in, and the profile won't contain any extraneous information. Once you have the trace file, you need to process it. There are several tools. The most common is TkProf, which is provided by Oracle, but really doesn't do a very good job. The best free profiler that I'm aware of, is OraSRP. Download a copy of OraSRP, and check your results. The data in the report should point you in the right direction.
Once you've done all that, if you still have questions, ask a new question here, and I'm sure we can help you interpret the output of OraSRP, to help you understand where your bottlenecks are.
Hope that helps.

Personally, I would lock/update the accounts in memory and update the database as a background task. Using this approach you can easily support thousands of updates and accounts.

A. Speed up things without modifying the code:
1 - You can keep the table entirely in the memory(that is SGA - because it is also on disks):
alter table t storage ( buffer_pool keep )
(discuss with your dba before to do this)
2 - if the table is too big and you update same rows again and again, probably it is sufficient to use the cache attribute:
alter table t cache
This command put the blocks of your table when they are used with best priority in the LRU list, so it is less chance to be aged from the SGA.
Here is it a discusion about differences: ask tom
3 - Another solution, advanced, that need more analysis and resources is TimesTen
B.Speed up your database operations:
Identify top querys and:
create indexes where you update or select only one row or a small set of rows.
partition large tables scanned for only a segment of data.
Have you identified a top query?

Best way to store small, alternating, public data that updates every couple of hours?

The essence of my problem is that there are too many solutions, and I would like to find which one wins out in pros and cons before I build an infrastructure around it.
(Simplified for the purpose of this forum) This is an auction site where five auctions are stored in a rank #1-5, #1 being the currently featured auction. The other four are simply "on deck." After either a couple hours or the completion of that auction, #2-5 move up to #1-4 and a new one is chosen to be #5
I'm using a dedicated server and I've been considering just storing the data in the servlet or maybe adding a column in the database as a boolean for each auction...like "isFeatured = 1"
Suffice it to say the data is read about 5 times+ more often than it is written, which is why I'm leaning towards good old SQL.

When you can retrieve the relevant auctions from DB with a simple query with ORDER BY and TOP or something similar then try this. If no performance issues occur then KISS and you're done.
Otherwise when these 5 auctions are valid for a while then cache them in memory. Have a singleton holding these auctions and provide methods for updating for example. Maybe you want to use a caching lib instead. Update these Top5 whenever necessary but serve them directly out of memory without hiting a DB or something similar expensive.

What kind of scale are you looking for? How many application servers need access to the data?
I think you're probably making this more complicated than it is. Just use a database, take a hit of ACID, and move onto whatever else you need to work on. :P

Have you taken a look at SQLite? It allows for "good old SQL" without all of the hassles of setting up a separate database server. As long as the data isn't too huge (to be fair, I haven't tested the size limits, but I've skimmed blog entries mentioning the use of SQLite to process files of several dozen MB in size quickly and with no problems), you should be fine.
It isn't a perfect solution for all needs (frankly, I sometimes find the dynamic typing to be a pain), but since it relies on locally stored files, reads will be much faster than firing up a network connection to talk to a more "traditional" RDBMS.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.