How to resolve java.sql.SQLException distributed transaction waiting for lock

How to resolve java.sql.SQLException distributed transaction waiting for lock - java

We are using Oracle 11G and JDK1.8 combination.
In our application we are using XAConnection, XAResource for DB transaction.
ie) distributed transactions.
On few occasions we need to kill our Java process to stop the application.
After killing, if we restart our application then we are getting the below exception while doing DB transaction.
java.sql.SQLException: ORA-02049: timeout: distributed transaction
waiting for lock
After this for few hours we are unable to use our application till the lock releases.
Can someone provide me some solution so that we can continue working instead of waiting for the lock to release.
I have tried the below option:
a) Fetched the SID and killed the session using alter command.After this also table lock is not released.
I am dealing with very small amount of data.

I followed one topic similar with that with tips about what to do with distributed connections.
Oracle connections remains open until you end your local session or until the number of database links for your session exceeds the value of OPEN_LINKS. To reduce the network overhead associated with keeping a database link open, then use this clause to close the link explicitly if you do not plan to use it again in your session.
I believe that, by closing your connections and sessions after DDL execution, this issue should not happens.
Other possibility is given on this question:
One possible way might be to increase the INIT.ORA parameter for distributed_lock_timeout to a larger value. This would then give you a longer time to observe the v$lock table as the locks would last for longer.
To achieve automation of this, you can either
- Run an SQL job every 5-10 seconds that logs the values of v$lock or the query that sandos has given above into a table and then analyze it to see which session was causing the lock.
- Run a STATSPACK or an AWR Report. The sessions that got locked should show up with high elapsed time and hence can be identified.
v$session has 3 more columns blocking_instance, blocking_session, blocking_session_statusthat can be added to the query above to give a picture of what is getting locked.
I hope I helped you, my friend.

Related

Postgres RDS database DB connections increasing infinitely on Saturdays causing "JDBCConnectionException" in Spring Boot Java API app

UPDATE Added Read/Write Throughput, IOPS, and Queue-Depth graphs metrics and marked graph at time-position where errors I speak of started
NOTE: Hi, just looking for suggestions of what could possibly be causing this issue from experienced DBA or database developers (or anyone that would have knowledge for that matter). Some of the logs/data I have are sensitive, so I cannot repost here but I did my best to provide screen shots and data from my debugging so it would allow people to help me. Thank you.
Hello, I have a Postgres RDS database (version 12.7 engine) that is hosted on Amazon (AWS). This database is "hit" or called by a API client (Spring Boot/Web/Hibernate/JPA Java API) thousands of times per hour. It is only executing one 1 hibernate sql query on the backend that is on a Postgres View across 5 tables. queryDB instance (class = db.m5.2xlarge) specs are:
8 vCPU
32 GB RAM
Provisioned IOPS SSD Storage Type
800 GiB Storage
15000 Provisioned IOPS
The issue I am seeing is on Saturdays I wake up to many logs of JDBCConnectionExceptions and I noticed my API Docker containers (Defined as Service-Task on ECS) which are hosted on AWS Elastic Container Service (ECS) will start failing and return a HTTP 503 error, e.g.
org.springframework.dao.DataAccessResourceFailureException: Unable to acquire JDBC Connection; nested exception is org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
Upon checking AWS RDS DB status, I can see also the sessions/connections increase dramatically, as seen in image below with ~600 connections. It will keep increasing, seeming to not stop.
Upon checking the postgres database pg_locks and pg_stat_activity tables when I started getting all these JDBCConnectionExceptions and the DB Connections jumped to around ~400 (at this specific time), I did indeed see many of my API queries logged with interesting statuses. I exported the data to CSV and have included an excerpt below:
wait_event_type wait_event state. query
--------------- ------------ --------------------------------------------- -----
IO DataFileRead active (480 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
IO DataFileRead idle (13 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
IO DataFilePreFetch active (57 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
IO DataFilePreFetch idle (2 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
Client ClientRead idle (196 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
Client ClientRead active (10 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
LWLock BufferIO idle (1 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
LWLock BufferIO active (7 times logged in pg_stat_activity) SELECT * ... FROM ... query from API on postgres View
If I look at my pg_stats_activity table when my API and DB are running and stable, the majority of the rows from the API query are simply Client ClientRead idle status, so I feel something is wrong here.
You can see the below "performance metrics" on the DB at the time this happened (i.e. roughly 19:55 UTC or 2:55PM CST), the DataFileRead and DataFilePrefetch are astronomically high and keep increasing, which backs up the pg_stat_activity data I posted above. Also, as I stated above, during normal DB use when it is stable, the API queries will simply be in Client ClientRead Idle status in pg_stat_activity table, the the numerous DataFileRead/Prefetches/IO and ExclusiveLocks confuses me.
I don't expect anyone to debug this for me, though I would appreciate it if a DBA or someone who has experienced similiar could narrow down the issue possibly for me. I honestly wasn't sure if it was an API query taking too long (wouldn't make sense, because API has ben running stable for years), something running on the Postgres DB without my knowledge on Saturday (I really think something like this is going on), or a bad postgresql Query coming into the DB that LOCKS UP the resources and causes a deadlock (doesn't completely make sense to me as I read Postgres resolves deadlocks on its own). Also, as I stated before, all the API calls that make an SQL query on the backend are just doing SELECT ... FROM ... on a Postgres VIEW, and from what I understand, you can do concurrent SELECTS with ExclusiveLocks so.....
Would take any advice here or suggestions for possible causes of this issue
Read-Throughput (first JdbcConnectionException occured around 2:58PM CST or 14:58, so I marked the graph where READ throughput starts to drop since the DB queries are timing out and API containers are failing)
Write-Throughput (API only READS so I'm assuming spikes here are for writing to Replica RDS to keep in-sync)
Total IOPS (IOPS gradually increasing from morning i.e. 8AM, but that is expected as API calls were increasing, but these total counts of API calls match other days when there are 0 issues so doesn't really point to cause of this issue)
Queue-Depth (you can see where I marked graph and where it spikes is exactly around 14:58 or 2:58PM where first JdbcConnectionExceptions start occuring, API queries start timing out, and Db connections start to increase exponentially)
EBS IO Balance (burst balance basically dropped to 0 at this time as-well)
Performance Insights (DataFileRead, DataFilePrefetch, buffer_io, etc)

This just looks like your app server is getting more and more demanding and the database can't keep up. Most of the rest of your observations are just a natural consequence of that. Why it is happening is probably best investigated from the app server, not from the database server. Either it is making more and more requests, or each one is takes more IO to fulfill. (You could maybe fix this on the database by making it more efficient, like adding a missing index, but that would require you sharing the query and/or its execution plan).
It looks like your app server is configured to maintain 200 connections at all times, even if almost all of them are idle. So, that is what it does.
And that is what ClientRead wait_event is, it is just sitting there idle trying to read the next request from the client but is not getting any. There are probably a handful of other connections which are actively receiving and processing requests, doing all the real work but occupying a small fraction of pg_stat_activity. All of those extra idle connections aren't doing any good. But they probably aren't doing any real harm either, other than making pg_stat_activity look untidy, and confusing you.
But once the app server starts generating requests faster than they can be serviced, the in-flight requests start piling up, and the app server is configured to keep adding more and more connections. But you can't bully the disk drives into delivering more throughput just by opening more connections (at least not once you have met a certain threshold where it is fully saturated). So the more active connections you have, the more they have to divide the same amount of IO between them, and the slower each one gets. Having these 700 extra connections all waiting isn't going to make the data arrive faster. Having more connections isn't doing any good, and is probably doing some harm as it creates contention and dealing with contention is itself a resource drain.
The ExclusiveLocks you mention are probably the locks each active session has on its own transaction ID. They wouldn't be a cause of problems, just an indication you have a lot of active sessions.
The BufferIO is what you get when two sessions want the exact same data at the same time. One asks for the data (DataFileRead) and the other asks to be notified when the first one is done (BufferIO).

Some things to investigate.
Query performance can degrade over time. The amount of data being requested can increase, especially with date predicated ones. Look at Performance Insights you can see how many blocks are read(disk/io), hit(from the buffer) You want as much hit as possible. The loss of burst balance is a real indicator that this is something that is happening. Its not an issue during the week as you have less requests.
The actual amount of shared buffers you have to service these queries, the default is 25% of RAM, you could tweak this to be higher, some say 40%.. Its a dark art and you will unlikely find an answer outside of tweak and test.
Vacuum and analyzing your tables. Data comes from somewhere right? With updates and deletes and inserts tables grow get full of garbage etc. At a certain point the autovacuum processes aren't enough at default levels. You can tweak these to be more agressive, manually fire at night etc.
Index management, same as above.
Autovacuum docs
Resource Consumption

Based on what you've shared I would guess your connections are not being properly closed.

How to SET LOCK MODE in java application

I am working on a Java web application that uses Weblogic to connect to an Informix database. In the application we have multiple threads creating records in a table.
It happens pretty often that it fails and the following error is thrown:
java.sql.SQLException: Could not do a physical-order read to fetch next row....
Caused by: java.sql.SQLException: ISAM error: record is locked.
I am assuming that both threads are trying to insert or update when the record is locked.
I did some research and found that there is an option to set the database that instead of throwing an error, it should wait for the lock to be released.
SET LOCK MODE TO WAIT;
SET LOCK MODE TO WAIT 17;
I don't think that there is an option in JDBC to use this setting. How do I go about using this setting in my java web app?

You can always just send that SQL straight up, using createStatement(), and then send that exact SQL.
The more 'normal' / modern approach to this problem is a combination of MVCC, the transaction level 'SERIALIZABLE', retry, and random backoff.
I have no idea if Informix is anywhere near that advanced, though. Modern DBs such as Postgres are (mysql does not count as modern for the purposes of MVCC/serializable/retry/backoff, and transactional safety).
Doing MVCC/Serializable/Retry/Backoff in raw JDBC is very complicated; use a library such as JDBI or JOOQ.
MVCC: A mechanism whereby transactions are shallow clones of the underlying data. 2 separate transactions can both read and write to the same records in the same table without getting in each other's way. Things aren't 'saved' until you commit the transaction.
SERIALIZABLE: A transaction level (also called isolationlevel), settable with jdbcDbObj.setTransactionIsolation(Connection.TRANSACTION_SERIALIZABLE); - the safest level. If you know how version control systems work: You're asking the database to aggressively rebase everything so that the entire chain of commits is ordered into a single long line of events: Each transaction acts as if it was done after the previous transaction was completed. The simplest way to implement this level is to globally lock all the things. This is, of course, very detrimental to multithread performance. In practice, good DB engines (such as postgres) are smarter than that: Multiple threads can simultaneously run transactions without just being frozen and waiting for locks; the DB engine instead checks if the things that the transaction did (not just writing, also reading) is conflict-free with simultaneous transactions. If yes, it's all allowed. If not, all but one simultaneous transaction throw a retry exception. This is the only level that lets you do this sequence of events safely:
Fetch the balance of isaace's bank account.
Fetch the balance of rzwitserloot's bank account.
subtract €10,- from isaace's number, failing if the balance is insufficient.
add €10,- to rzwitserloot's number.
Write isaace's new balance to the db.
Write rzwitserloot's new balance to the db.
commit the transaction.
Any level less than SERIALIZABLE will silently fail the job; if multiple threads do the above simultaneously, no SQLExceptions occur but the sum of the balance of isaace and rzwitserloot will change over time (money is lost or created – in between steps 1 & 2 vs. step 5/6/7, another thread sets new balances, but these new balances are lost due to the update in 5/6/7). With serializable, that cannot happen.
RETRY: The way smart DBs solve the problem is by failing (with a 'retry' error) all but one transaction, by checking if all SELECTs done by the entire transaction are not affected by any transactions that been committed to the db after this transaction was opened. If the answer is yes (some selects would have gone differently), the transaction fails. The point of this error is to tell the code that ran the transaction to just.. start from the top and do it again. Most likely this time there won't be a conflict and it will work. The assumption is that conflicts CAN occur but usually do not occur, so it is better to assume 'fair weather' (no locks, just do your stuff), check afterwards, and try again in the exotic scenario that it conflicted, vs. trying to lock rows and tables. Note that for example ethernet works the same way (assume fair weather, recover errors afterwards).
BACKOFF: One problem with retry is that computers are too consistent: If 2 threads get in the way of each other, they can both fail, both try again, just to fail again, forever. The solution is that the threads twiddle their thumbs for a random amount of time, to guarantee that at some point, one of the two conflicting retriers 'wins'.
In other words, if you want to do it 'right' (see the bank account example), but also relatively 'fast' (not globally locking), get a DB that can do this, and use JDBI or JOOQ; otherwise, you'd have to write code to run all DB stuff in a lambda block, catch the SQLException, check the SqlState to see if it is indicating that you should retry (sqlstate codes are DB-engine specific), and if yes, rerun that lambda, after waiting an exponentially increasing amount of time that also includes a random factor. That's fairly complicated, which is why I strongly advise you rely on JOOQ or JDBI to take care of this for you.
If you aren't ready for that level of DB usage, just make a statement and send "SET LOCK MDOE TO WAIT 17;" as SQL statement straight up, at the start of opening any connection. If you're using a connection pool there is usually a place you can configure SQL statements to be run on connection start.

The Informix JDBC driver does allow you to automatically set the lock wait mode when you connect to the server.
Simply pass via the DataSource or connection URL the following parameter
IFX_LOCK_MODE_WAIT=17
The values for JDBC are
(-1) Wait forever
(0) not wait (default)
(> 0) wait this many seconds
See https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.jdbc.doc/ids_jdbc_040.htm

Connection conn = DriverManager.getConnection ( "jdbc:Informix-sqli://cleo:1550:
IFXHOST=cleo;PORTNO=1550;user=rdtest;password=my_passwd;IFX_LOCK_MODE_WAIT=17";);

What happens when you do not close an HBase table?

I am considering creating a HBase table when my application starts up and leaving it open as long as my application is running. My application may run indefinitely.
What happens if I never close the HBase table?
Is there a maximum time the connection can be open/idle before it need to be reinitialized?
How is the connection closed if the system crashed?
I have HBase The Definitive Guide but I have not found the information I am looking for in there. If there are any online references for this then please provide them.

This was extracted from "HBase in Action" page 25:
"Closing the table when you’re finished with it allows the underlying
connection resources to be returned to the pool."

This blog post is about timeouts in HBase. Generally speaking, there is a lot of them:
ZK session timeout (zookeeper.session.timeout)
RPC timeout (hbase.rpc.timeout)
RecoverableZookeeper retry count and retry wait (zookeeper.recovery.retry, zookeeper.recovery.retry.intervalmill)
Client retry count and wait (hbase.client.retries.number, hbase.client.pause)
You may try to raise them a bit and set a really high value for retry count. This can make your sessions be alive for a very long period of time.
When the system of HBase client crashed, the connection is closed by timeout.

DB2 JDBC Driver (Type 4) hangs on Execute()

I am executing a series of sql statements using a JDBC connection on a DB2 server. On the last execute() of the simple sql: DELETE FROM MYTABLE, the thread gets hung for a long period of time even if the table somply contains a single record.
The application server I am using is WAS. I wonder if this is an issue specific to WAS and DB2 combination as the same code works on other environments.
Does anybody have any idea what is going on here?

Have you issue the command directly from the CLP? It could be other problem such as:
Transaction log problem: There are a lot of rows to delete, and this takes time. Also, the transaction logs have reached the limit, and the database does not do a rollback but waits for empty log freed by other transactions.
Lock problem (concurrency): some of the rows your are trying to delete have locks in other transactions, and the applications has to wait to release them (lock wait)
Also, try to do frequent commits.

Deleting rows in a database can be a terrible work: don't forget the database server will log all the data of the table in case of a ROLLBACK. Then I assume the problem is coming from the database especially if the table has many rows.
Have you tried to run manually all the SQL requests yourself in an interactive environment?

Thread dump showing Runnable state, but its hung for quite a long time

We are facing an unusual problem in our application, in the last one month our application reached an unrecoverable state, It was recovered post application restart.
Background : Our application makes a DB query to fetch some information and this Database is hosted on a separate node.
Problematic case : When the thread dump was analyzed we see all the threads are in runnable state fetching the data from the database, but it didn't finished even after 20 minutes.
Post the application restart as expected all threads recovered. And the CPU usage was also normal.
Below is the thread dump
ThreadPool:2:47" prio=3 tid=0x0000000007334000 nid=0x5f runnable
[0xfffffd7fe9f54000] java.lang.Thread.State: RUNNABLE at
oracle.jdbc.driver.T2CStatement.t2cParseExecuteDescribe(Native Method)
at
oracle.jdbc.driver.T2CPreparedStatement.executeForDescribe(T2CPreparedStatement.java:518)
at
oracle.jdbc.driver.T2CPreparedStatement.executeForRows(T2CPreparedStatement.java:764)
at ora
All threads in the same state.
Questions:
what could be the reason for this state?
how to recover under this case ?

It's probably waiting for network data from the database server. Java threads waiting (blocked) on I/O are described by the JVM as being in the state RUNNABLE even though from the program's point of view they're blocked.

As others mentioned already, that native methods are always in runnable, as the JVM doesn't know/care about them.
The Oracle drivers on the client side have no socket timeout by default. This means if you have network issues, the client's low level socket may "stuck" there for ever, resulting in a maxxed out connection pool. You could also check the network trafic towards the Oracle server to see if it even transmits data or not.
When using the thin client, you can set oracle.jdbc.ReadTimeout, but I don't know how to do that for the thick (oci) client you use, I'm not familiar with it.
What to do? Research how can you specify read timeout for the thick ojdbc driver, and watch for exceptions related to the connection timeout, that will clearly signal network issues. If you can change the source, you can wrap the calls and retry the session when you catch timeout-related SQLExceptions.
To quickly address the issue, terminate the connection on the Oracle server manually.
Worth checking the session contention, maybe a query blocks these sessions. If you find one, you'll see which database object causes the problem.

Does your code manually handle transaction? If then, maybe some of the code didn't commit() after changing data. Or maybe someone ran data modification query directly through PLSQL or something and didn't commit, and that leads all reading operation to be hung.
When you experienced that "hung" and DB has recovered from the status, did you check the data if some of them were rolled back? Asking this since you said "It was recovered post application restart.". It's happening when JDBC driver changed stuff but didn't commit, and timeout happened... DB operation will be rolled back. ( can be different based on the configuration though )

Native methods remain always in RUNNABLE state (ok, unless you change the state from the native method, itself, but this doesn't count).
The method can be blocked on IO, any other event waiting or just long cpu intense task... or endless loop.
You can make your own pick.
how to recover under this case ?
drop the connection from oracle.

Is the system or JVM getting hanged?
If configurable and if possible, reduce the number of threads/ parallel connections.
The thread simply waste CPU cycles when waiting for IO.
Yes your CPU is unfortunately kept busy by the threads who are awaiting a response from DB.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.