Does db computation worth Parallelizing?

Does db computation worth Parallelizing? - java

I use DB2 9.7.5 64Bits. The server has enough memory but no clustering.
I need to make huge computations : compute several (roughly 20) ratios in my db. Some of them can take as long as 25 seconds.
The results are stored in a result table.
Now I have several solutions (As a policy, we exclude Stored Proc).
I call each ratio, one at a time from a java client OR
I call several ratios in a multi threaded java client.
My assumption is that it is useless to call from a multi threaded since my db is the bottleneck. But I'm not wholly sure that the db engine really gives 100% of the cpu for 1 query. I think that the engine must probably be able to share its cpu power between several queries.
I am currently reading the IBM Data manual but would like to have your feedback.
Many thanks.

I need to make huge computations : compute several (roughly 20) ratios in my db. Some of them can take as long as 25 seconds.
25 seconds is not necessarily a bad thing. maybe its a wonderful result, depends on what you compute
Now I have several solutions (As a policy, we exclude Stored Proc).
Stored proc are not evil, you just need to know how to use them safely
My assumption is that it is useless to call from a multi threaded since my db is the bottleneck. But I'm not wholly sure that the db engine really gives 100% of the cpu for 1 query. I think that the engine must probably be able to share its cpu power between several queries.
multithreading in java never hearts (as long as you keep the threads safe), especially useful in your case when you are doing alot of calculations.
I don's use db2 so I don't know how good it is on multithreading, but if its single threaded I doubt that it will ever reach 100% cpu usage. you should check the conf files of your db2 to tweek it a little bit
Also read the article about IBM DB2 clustering
I also suggest using a data warehouse tool to analyze your script performance againest the db2
Good luck

Take a look at Materialized Query Tables. If what you are working with is reporting, and especially doesn't require absolutely up-to-date information, you can set up MQTs that will contain the parts that are heavy to calculate with for instance hourly versions.

Related

Alternatives for Spark Dataframe's count() API

I'm using Spark with Java connector to process my data.
One of the essential operations I need to do with the data is to count the number of records (row) within a data frame.
I tried df.count() but the execution time is extremely slow (30-40 seconds for 2-3M records).
Also, due to the system's requirement, I don't want to use df.rdd().countApprox() API because we need the exact count number.
Could somebody give me a suggestion of any alternatives that return exactly the same result as df.count() does, with faster execution time?
Highly appreciate your replies.

df.cache
df.count
It will be slow for the first time, since it caches during the execution of count for the first time, but in subsequent count will provide you good performance.
Leveraging df.cache depends on the use case.

A simple way to check if a dataframe has rows, is to do a Try(df.head). If Success, then there's at least one row in the dataframe. If Failure, then the dataframe is empty. Here's a scala implementation of this.
Here is the reason why df.count() is a slow operation.

Count is very fast. You need to look to some of your other operations, the data loading and transformations you do to generate the Data frame that you are counting. That is the part slowing you down not the count itself.
If you can reduce the amount of data you load or cut out any transformations that don't affect the count you may be able to speed things up. If that's not an option you may be able to. Write your transformations more efficiently. Without knowing your transformations though it's not possible to say what the bottleneck might be.

I just found out that loading data into Spark data frame for further queries and count is unecessary.
Instead, we can use aerospike client to do the job and it's much faster than the above approach.
Here's the reference of how to use aerospike client
http://www.aerospike.com/launchpad/query_multiple_filters.html
Thanks everyone

Increase CPU utilization using parallel stream for DB intensive task

I am using java 8 parallel stream to insert data into DB.
The following is the code
customers.parallelStream().forEach(t->{
UserTransaction userTransaction = new UserTransactionImp();
try {
userTransaction.begin();
userTransaction.setTransactionTimeout(300);
//CODE to write data to DB for each customer in a global transaction using atomikos and hibernate
userTransaction.commit();
}catch(Exception e){
userTransaction.rollback();
}
});
it takes more than 2 hours to complete the task.I did the same test running in two different instances(two java main methods).The time taken to complete came down to 1 hour.Is there any other way to scale up within one java intance.I am using Atomikos,Hibernate for persistence.I had configured batching,inserts ordering and update ordering.Evrything is batched properly and working fine .
But I observed that the CPU is not utilized more than 30% during this.Is there any way to utilize more Processors and scaling it up .

parallelStream() basically gives you a "default" implementation. I heard a guy once say: "whenever you use this construct, measure its effects".
In other words: when you are not happy with the default implementation, you might have to look into your own implementation. Not focused on that single operation but the "whole picture".
Example: what if you "badge" together 5, 10, 50 "users" per "shot" - meaning: you reduce the number of transactions, but you allow more content to go into each.
Yes, that is a pretty generic answer - but this is a pretty generic question. We have absolutely no insights what your code is doing there - so nobody here can tell what would be the "perfect" way to reduce overall runtime.
Beyond that: you want to profile your whole setup. Maybe your problem is not the "java" part - but your database. Not enough memory, too much workload ... or network, or, or, or. In other words: first focus on gaining an understanding where your performance bottleneck truly exists.
( a good read about "performance" and bottlenecks: the old classic "Release it" by Michael Nygard )

The fastest way to populate a In Memory Data Grid Hazelcast

What is the fastest way to populate a Hazelcast Data Grid. Reading through documentation I can see couple of variants:
Use multithreading and IMap.set
Use multithreading and IMap.putAll
Use a Distributed Execution in order to start populating the grid from all participants.
My performance benchmark shows that IMap.putAll is faster than IMap.Set. But it is stated in the Hazelcasty Documentation that IMap.putAll does not come with guarantees that everything will be inserted atomically.
Can someone clarify a little bit about what would be the fastest way to populate a data grid with data ?
Is variant number 3 good ?

I would see the same three options. Anyhow as you mentioned, option two does not guarantee that everything was put into the map atomically but if you just load data and wait for all threads to finish loading data using IMap::putAll you should be fine.
Apart from that IMap::set would be the alternative. In any case you want to multithread the loading process. I would play around a bit with different thread numbers and loading data from a client is normally recommended to keep nodes free for storage operations.
I personally never benchmarked your third option, anyhow it would be possible as well. Just not sure it is worth the additional work.
How much data do you want to load that you're concerned it could be slow? Do you already know that loading is slow? Do you use Java Serialization (this is a huge performance killer)? Do you use indexes (those have to be generated while putting data)?
There's normally a lot of optimizations to apply to speed up, not only, data loading but also normal operation.

Best way to store small, alternating, public data that updates every couple of hours?

The essence of my problem is that there are too many solutions, and I would like to find which one wins out in pros and cons before I build an infrastructure around it.
(Simplified for the purpose of this forum) This is an auction site where five auctions are stored in a rank #1-5, #1 being the currently featured auction. The other four are simply "on deck." After either a couple hours or the completion of that auction, #2-5 move up to #1-4 and a new one is chosen to be #5
I'm using a dedicated server and I've been considering just storing the data in the servlet or maybe adding a column in the database as a boolean for each auction...like "isFeatured = 1"
Suffice it to say the data is read about 5 times+ more often than it is written, which is why I'm leaning towards good old SQL.

When you can retrieve the relevant auctions from DB with a simple query with ORDER BY and TOP or something similar then try this. If no performance issues occur then KISS and you're done.
Otherwise when these 5 auctions are valid for a while then cache them in memory. Have a singleton holding these auctions and provide methods for updating for example. Maybe you want to use a caching lib instead. Update these Top5 whenever necessary but serve them directly out of memory without hiting a DB or something similar expensive.

What kind of scale are you looking for? How many application servers need access to the data?
I think you're probably making this more complicated than it is. Just use a database, take a hit of ACID, and move onto whatever else you need to work on. :P

Have you taken a look at SQLite? It allows for "good old SQL" without all of the hassles of setting up a separate database server. As long as the data isn't too huge (to be fair, I haven't tested the size limits, but I've skimmed blog entries mentioning the use of SQLite to process files of several dozen MB in size quickly and with no problems), you should be fine.
It isn't a perfect solution for all needs (frankly, I sometimes find the dynamic typing to be a pain), but since it relies on locally stored files, reads will be much faster than firing up a network connection to talk to a more "traditional" RDBMS.

Java: Advice on handling large data volumes. (Part Deux)

Alright. So I have a very large amount of binary data (let's say, 10GB) distributed over a bunch of files (let's say, 5000) of varying lengths.
I am writing a Java application to process this data, and I wish to institute a good design for the data access. Typically what will happen is such:
One way or another, all the data will be read during the course of processing.
Each file is (typically) read sequentially, requiring only a few kilobytes at a time. However, it is often necessary to have, say, the first few kilobytes of each file simultaneously, or the middle few kilobytes of each file simultaneously, etc.
There are times when the application will want random access to a byte or two here and there.
Currently I am using the RandomAccessFile class to read into byte buffers (and ByteBuffers). My ultimate goal is to encapsulate the data access into some class such that it is fast and I never have to worry about it again. The basic functionality is that I will be asking it to read frames of data from specified files, and I wish to minimize the I/O operations given the considerations above.
Examples for typical access:
Give me the first 10 kilobytes of all my files!
Give me byte 0 through 999 of file F, then give me byte 1 through 1000, then give me 2 through 1001, etc, etc, ...
Give me a megabyte of data from file F starting at such and such byte!
Any suggestions for a good design?

Use Java NIO and MappedByteBuffers, and treat your files as a list of byte arrays. Then, let the OS worry about the details of caching, read, flushing etc.

#Will
Pretty good results. Reading a large binary file quick comparison:
Test 1 - Basic sequential read with RandomAccessFile.
2656 ms
Test 2 - Basic sequential read with buffering.
47 ms
Test 3 - Basic sequential read with MappedByteBuffers and further frame buffering optimization.
16 ms

Wow. You are basically implementing a database from scratch. Is there any possibility of importing the data into an actual RDBMS and just using SQL?
If you do it yourself you will eventually want to implement some sort of caching mechanism, so the data you need comes out of RAM if it is there, and you are reading and writing the files in a lower layer.
Of course, this also entails a lot of complex transactional logic to make sure your data stays consistent.

I was going to suggest that you follow up on Eric's database idea and learn how databases manage their buffers—effectively implementing their own virtual memory management.
But as I thought about it more, I concluded that most operating systems are already a better job of implementing file system caching than you can likely do without low-level access in Java.
There is one lesson from database buffer management that you might consider, though. Databases use an understanding of the query plan to optimize the management strategy.
In a relational database, it's often best to evict the most-recently-used block from the cache. For example, a "young" block holding a child record in a join won't be looked at again, while the block containing its parent record is still in use even though it's "older".
Operating system file caches, on the other hand, are optimized to reuse recently used data (and reading ahead of the most recently used data). If your application doesn't fit that pattern, it may be worth managing the cache yourself.

You may want to take a look at an open source, simple object database called jdbm - it has a lot of this kind of thing developed, including ACID capabilities.
I've done a number of contributions to the project, and it would be worth a review of the source code if nothing else to see how we solved many of the same problems you might be working on.
Now, if your data files are not under your control (i.e. you are parsing text files generated by someone else, etc...) then the page-structured type of storage that jdbm uses may not be appropriate for you - but if all of these files are files that you are creating and working with, it may be worth a look.

#Eric
But my queries are going to be much, much simpler than anything I can do with SQL. And wouldn't a database access be much more expensive than a binary data read?

This is to answer the part about minimizing I/O traffic. On the Java side, all you can really do is wrap your readers in BufferedReaders. Aside from that, your operating system will handle other optimizations like keeping recently-read data in the page cache and doing read-ahead on files to speed up sequential reads. There's no point in doing additional buffering in Java (although you'll still need a byte buffer to return the data to the client).

I had someone recommend hadoop (http://hadoop.apache.org) to me just the other day. It looks like it could be pretty nice, and might have some marketplace traction.

I would step back and ask yourself why you are using files as your system of record, and what gains that gives you over using a database. A database certainly gives you the ability to structure your data. Given the SQL standard, it might be more maintainable in the long run.
On the other hand, your file data may not be structured so easily within the constraints of a database. The largest search company in the world :) doesn't use a database for their business processing. See here and here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Does db computation worth Parallelizing? - java

Take a look at Materialized Query Tables. If what you are working with is reporting, and especially doesn't require absolutely up-to-date information, you can set up MQTs that will contain the parts that are heavy to calculate with for instance hourly versions.

Related

Alternatives for Spark Dataframe's count() API

Increase CPU utilization using parallel stream for DB intensive task

The fastest way to populate a In Memory Data Grid Hazelcast

Best way to store small, alternating, public data that updates every couple of hours?

Java: Advice on handling large data volumes. (Part Deux)

Categories

Resources