Apply Batch keyword to select statements

Apply Batch keyword to select statements - java

Is it possible to execute a batch of select statements using dse cassandra or should i consider a design change?
The reason is i have a lot of select queries i wish to execute against my db cluster and not sure about going about it. I have deleted all my secondary indexes so im not using those anymore.

That won't work and even if it would, it isn't adviseable.
You won't recieve the results in a way that you can use, no result set
Even if that worked, the batch query would be much less performant than doing them serially due to the way Cassandra batching is implemented.
Batching only works well if the keys (write executions) are distributed in an equal way, and this is only worth it if you want to do all the updates as a transaction.
So in summary you should definitely consider a design change

Related

How to get data from Oracle table into java application concurrently

I have an Oracle table with ~10 million records that are not dependent on each other . An existing Java application executes the query an iterates through the returned Iterator batching the records for further processing. The fetchSize is set to 250.
Is there any way to parallelize getting the data from the Oracle DB? One thing that comes to mind is to break down the query into chunks using "rowid" and then pass these chunks to separate threads.
I am wondering if there is some kind of standard approach in solving this issue.

Few approaches to achieve it:
alter session force parallel QUERY parallel 32; execute this at DB level in PL/SQL code just before the execution of SELECT statement. You can adjust the 32 value depends on number of Nodes (RAC setup).
The approach which you are doing on the basis of ROWID but the difficult part is how you return the chunk of SELECT queries to JAVA and how you can combine that result. So this approach is bit difficult.

how strong consistency and eventual consistency work in datastore

i'm no expert in Databases so what i know about queries is that they are the way to read or write in databases
in eventual consistency read will return stale data
in write query first data node will be updated but other node will need some time to be updated
in strong consistency read will be locked until data get modified to it latest version (really i'm not sure about what i said here so help if u got it wrong)
in write query all read operations for will be lock until data node get modified to its latest version
so if i write data as eventual and tried ancestors query to get that data will i get the latest version ?
if i used ancestors query to update would all eventual read operation get the latest version ?
update
i think Transactions is there so if there is multi modification request to the same data 1 will succeeded and other will fail after that the data the have been modified will take some time to be replicated in all datacenter so if transaction succeeded does not mean all read query will return the latest version (correct me if i'm right)

If you use what you call an "ancestor query", you're working in a transaction: either the transaction terminates successfully, in which case all subsequent reads will get the values as updated by the transaction, or else the transaction fails, in which case none of the changes made by the transaction will be seen (this all-or-nothing property is often referred to as a transaction being "atomic"). In particular, you do get strong consistency this way, not just eventual consistency.
The cost can be large, in terms of performance and scalability. In particular, an application should not update an entity group (any and all entities descending from a common ancestor) more than once a second, which can be a very constraining limit for a highly scalable application.
The online docs include a large variety of tips, tricks and advice on how to deal with this -- you could start at https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/ and continue with the "additional resources" this article lists at the end.
One simple idea that often suffices is that (differently from queries) getting a specific entity from its key is strongly consistent without needing transactions, and memcache is also strongly consistent; writing a modified entity gives you its new key, so you can stash that key into memcache and have other parts of your code fetch the modified entity from that key, rather than relying on queries. This has limits, of course, because memcache doesn't give you unbounded space -- but it's a useful idea to keep in mind, nevertheless, in many practical cases.

With GAE the only way to be consistante is to use transaction, into a transaction you can update, query the last update but it's slower.
For me using ancestors just compose the primary key and that's all.

SQL Joins vs Java code?

I have a query like this
Select Folder.name from FROM FolderTable,ValidFolder, ValidFolderGroup, ValidUser,
ValidLocation, ValidDepartment where ValidUser.LocationCode *= ValidLocation.LocationCode
and ValidUser.DepartmentCode *= ValidDepartment.DepartmentCode and Folder.IssueUser =
ValidUser.UserId and ValidFolder.FolderType = Folder.FolderType and
ValidFolderGroup.FolderGroupCode = ValidFolder.FolderGroupCode and
ValidFolderGroup.GroupTypeCode = 13 and (ValidUser.UserId='User' OR
ValidUser.ManagerId='User') and ValidFolderGroup.GroupTypeCode = 13 and
Folder.IssueUser = 'User'
Now here all the table which start with Valid are cache table so these table already contains data .
Suppose if someone using JOOQ or Hibernate which one will be the best option
Use query as written above with all Joins?
Or Use Java code to fulfill the requirement rather than join because as user using Hibernate or JOOQ it already have Java class for the table and Valid table have already all the data ?

Okay, you're probably not going to like this answer, but the best way to do this is not to keep Valid "cached".
The best solution in my opinion would be to use jOOQ (if you prefer DSL) or Hibernate (if you prefer OR mapping) and query the Database every time, and consistently use the DAO pattern.
The jOOQ and Hibernate guys are almost certainly better at SQL than you are. We've used jOOQ and Hibernate in really large enterprise projects, and they both perform exceptionally. Particularly with a good connection pool like BoneCP. If after you've got that setup running, and running well, but still think you may have performance issues, you can always add a cache (like EhCache) afterwards.
Ultimately tho', I'm making a lot of assumptions about your software, namely that
There are more people than you working on it, and
It has to be maintained. If neither of these assumptions are true, then you can safely disregard this answer.

General answer:
Modern databases are incredibly good at optimising your query and choosing the best possible execution plan for you. Given your outer join notation using *=, you're obviously using SQL Server, so that's a pretty good database.
Even if you already have much of the "Valid" data in your application memory, chances are that your database also already has the same data in a buffer cache and thus the database doesn't need to hit the disk again for the various joins in your query.
In fact, depending on the nature of your data, the database might even assess that some of your joins are unneeded (if you have the right meta data, like constraints).
Specific answer:
In your particular case, it looks as though you can indeed strip most of your query yourself and query only the Folder table using search criteria from your application's "Valid" cache. I'm saying that it looks like it, because I don't fully understand the business logic behind those joins and whether they're all modelling 1:1 relationships, or whether removing them will change the semantics of the query.
So, technically, it's possible that you can remove the joins, but if you want to stay on the safe side, just keep things as they are as you migrate to jOOQ or Hibernate.
Alternative 3:
Of course, instead of tampering with this query, you might even be able to remove this query and fetch the Folder.name property already in your previous queries when you load the "Valid" content into memory.

Ever heard of views? Look into them, you'll be amazed.
Apart from that, it's impossible to say what you should do, there's no "best" and you provide way too little information to even make an educated guess about your specific requirements.
But, I'd not hard code things like database IDs in a query that ends up inside any program, far too prone to cause problems in the (near) future.

Multi threaded insert using ORM?

I have one application where "persisting to database" is consuming 85% time of the entire application flow.
I was thinking of using multiple threads to do the insert because inserts are mostly independent here. Is there any way to achieve multi threaded insert using any of JPA implementation ? Or is it worth doing the mutli threaded insert, from improving the performance perspective ?
Note: Inserts are in the range of 10K to 100K records in a single run. Also performance is very very critical here.
Thanks.

Multi-threading insert statements on database won't really make it perform any faster because in most databases the table requires a lock for an insert. So your threads will just be waiting for the one before it to finish up and unlock the table before the next can insert - which really doesn't make it any more multi-threaded than with a single thread. If you where to do it, it would most likely slow it down.
If you inserting 10k-100k records you should consider using either batch insert statements or bulk insert commands that are native to the database your using. The fastest way would be the native bulk insert commands but it would require you to not use JPA and to work directly with JDBC calls for the inserts you want to use bulk commands on.
If you don't want to play around with native bulk commands I recommend using Spring's JDBCTemplate which has templated batch insert commands. It is very fast and I use it to batch insert 10k-20k entities every 30 seconds on a high transaction system and I am very pleased with the performance.
Lastly, make sure your database tables are optimized with the correct indexes, keys and options. Since your database is the bottleneck this should be one of the first places you look to increase performance.

Multi-threading insert statements on database won't really make it perform any faster
because in most databases the table requires a lock for an insert. So your threads will
just be waiting for the one before it to finish up and unlock the table before the next can
insert - which really doesn't make it any more multi-threaded than with a single thread. If
you where to do it, it would most likely slow it down.
Are you saying concurrent inserts from different db connections on the same table require exclusive locks to complete? I tested this on Oracle, and I didn't find this to be the case. Do you actually have a test case to back up what you wrote here?
Anyway, bulk insert is of course a lot faster than one insert at a time.

Are you periodically flushing your session when doing this? if not, you can hit nasty slowdowns that have nothing to do with the database. generally, you want to "batch" the inserts by periodically calling flush() then clear() on your session (assuming you are using some variant of JPA).

This article has many tips to improve batch writing performance with JPA. I'll quote the two that should give you the best result for fast reference.
Optimization #6 - Sequence
Pre-allocation
We have optimized the
first part of the application, reading
from the MySQL database. The second
part is to optimize the writing to
Oracle.
The biggest issue with the writing
process is that the Id generation is
using an allocation size of 1. This
means that for every insert there will
be an update and a select for the next
sequence number. This is a major
issue, as it is effectively doubling
the amount of database access. By
default JPA uses a pre-allocation size
of 50 for TABLE and SEQUENCE Id
generation, and 1 for IDENTITY Id
generation (a very good reason to
never use IDENTITY Id generation). But
frequently applications are
unnecessarily paranoid of holes in
their Id values and set the
pre-allocaiton value to 1. By changing
the pre-allocation size from 1 to 500,
we reduce about 1000 database accesses
per page.
Optimization #8 - Batch Writing
Many
databases provide an optimization that
allows a batch of write operations to
be performed as a single database
access. There is both parametrized and
dynamic batch writing. For
parametrized batch writing a single
parametrized SQL statement can be
executed with a batch of parameter
vales instead of a single set of
parameter values. This is very optimal
as the SQL only needs to be executed
once, and all of the data can be
passed optimally to the database.
Dynamic batch writing requires dynamic
(non-parametrized) SQL that is batched
into a single big statement and sent
to the database all at once. The
database then needs to process this
huge string and execute each
statement. This requires the database
do a lot of work parsing the
statement, so is no always optimal. It
does reduce the database access, so if
the database is remote or poorly
connected with the application, this
can result in an improvement.
In general parametrized batch writing
is much more optimal, and on Oracle it
provides a huge benefit, where as
dynamic does not. JDBC defines the API
for batch writing, but not all JDBC
drivers support it, some support the
API but then execute the statements
one by one, so it is important to test
that your database supports the
optimization before using it. In
EclipseLink batch writing is enabled
using the persistence unit property
"eclipselink.jdbc.batch-writing"="JDBC".
Another important aspect of using
batch writing is that you must have
the same SQL (DML actually) statement
being executed in a grouped fashion in
a single transaction. Some JPA
providers do not order their DML, so
you can end up ping-ponging between
two statements such as the order
insert and the order-line insert,
making batch writing in-effective.
Fortunately EclipseLink orders and
groups its DML, so usage of batch
writing reduces the database access
from 500 order inserts and 5000
order-line inserts to 55 (default
batch size is 100). We could increase
the batch size using
"eclipselink.jdbc.batch-writing.size",
so increasing the batch size to 1000
reduces the database accesses to 6 per
page.

Is "LIKE ?" More efficient than LIKE '%'||?||'%'

Recently one of my colleagues made a comment that I should not use
LIKE '%'||?||'%'
rather use
LIKE ?
in the SQL and then replace the LIKE ? marker with LIKE '%'||?||'%' before I execute the SQL. He made the point that with a single parameter marker DB2 database will cache the statement always and thus cut down on the SQL prepare time.
However, I am not sure if it is accurate or not. To me it should be the other way around since we are doing more processing by doing a string replace on the SQL everytime the query is getting executed.
Does anyone know if a single marker really speeds up execution? Just FYI - I am using Spring 2.5 JDBC framework and the DB2 version is 9.2.
My question is - does DB2 treat "LIKE ?" differently from "LIKE '%'||?||'%'" as far as caching and preparation goes.

'LIKE ?' is a PreparedStatement. Prepared statements are an optimization at the JDBC driver level. The thinking is that databases analyze queries to decide how to most efficiently process them. The DB can then cache the resulting query plan, keyed on the full statement. Reusing identical statements reuses the query plan. So basically if you are running the same query multiple times with different comparison strings, and if the query plan stays cached, then yes, using 'LIKE ?' will be faster.
Some useful (though somewhat dated) info on PreparedStatements:
Prepared Statments
More Prepared Statments

I haven't done too much DB2, not since the 90's and I'm not really sure if I'm understanding what your underlying question is. Way back then I got a phone call from the head of the DBA team. "What are you doing different than every other programmer we've got!??" Mind you, this was early in my career, so tentatively I answered, "Nothing....", imagine it in kind of a whiny voice. "Well then, why do your queries take 50% of the cpu resources of any the other guys???". I took a quick poll of all the other guys and found I was the only one using prepared statements. Now under the covers Spring automatically makes prepared statements, and they've improved statement caching in the database a lot over the years, but if you make use of the properly, you can get the speedup there, AND it'll make the statement cache swap things out less often. It really depends on your use case, if you're only going to hit the query once, then there would be no difference, if its a few thousand times, obviously it would make a much greater difference.

in the SQL and then replace the LIKE ? marker with LIKE '%'||?||'%' before I execute the SQL. He made the point that with a single parameter marker DB2 database will cache the statement always and thus cut down on the SQL prepare time.
Unless DB2 is some sort of weird alien SQL database, or if it's driver does some crazy things, then the database server will never see your prepared statement until you actually execute it. So you can swap clauses in and out of the PreparedStatement all day long, and it will have no effect until you actually send it to the server when you execute it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.