I'm running a spark job on Dataproc which reads lots of files from a bucket and consolidates them to one big file. I'm using google-api-services-storage 1.29.0 by shading it. Until now it worked fine, consolidating ~20-30K files. Today I tried it with about 5 times as many files and suddenly I'm getting a deadlock (at east I think I am, because it seems that all my executors are waiting indefinitely).
This is the thread dump:
org.conscrypt.NativeCrypto.SSL_read(Native Method)
org.conscrypt.NativeSsl.read(NativeSsl.java:416)
org.conscrypt.ConscryptFileDescriptorSocket$SSLInputStream.read(ConscryptFileDescriptorSocket.java:547) => holding Monitor(java.lang.Object#1638155334})
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
java.io.BufferedInputStream.read(BufferedInputStream.java:345) => holding Monitor(java.io.BufferedInputStream#1513035694})
sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587) => holding Monitor(sun.net.www.protocol.https.DelegateHttpsURLConnection#995846771})
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492) => holding Monitor(sun.net.www.protocol.https.DelegateHttpsURLConnection#995846771})
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
com.shaded.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
com.shaded.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:105)
com.shaded.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
com.shaded.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
com.shaded.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
com.shaded.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:380)
com.shaded.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6189)
com.shaded.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:584)
com.shaded.google.cloud.storage.StorageImpl$16.call(StorageImpl.java:464)
com.shaded.google.cloud.storage.StorageImpl$16.call(StorageImpl.java:461)
com.shaded.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:89)
com.shaded.google.cloud.RetryHelper.run(RetryHelper.java:74)
com.shaded.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:51)
com.shaded.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:461)
com.shaded.google.cloud.storage.Blob.getContent(Blob.java:455)
my.package.with.my.StorageAPI.readFetchedLocation(StorageAPI.java:71)
...
Eventually I have to kill the job because nothing happens.
Any idea what is causing it? I tried using both a ThreadLocal<Storage> and a single Storage instance in my code, it doesn't seem to make a difference.
The job was actually NOT deadlocked, it's just the Spark UI that for some reason didn't show any progress of the tasks until the stage finished. I thought nothing was happening but if I do a thread-dump repeatedly then I can see that it's doing stuff.
As tix suggested in a comment, it's probably advisable to implement exponential backoff when using the Storage library directly, and retry if I get a StorageException which isRetryable().
Related
I'm using an ElasticSearch cluster in my Production environments for months now.
This cluster contains 2 nodes, which are Windows Server 2019 servers.
Sometimes, a random node of this cluster suddenly get stuck until i reboot the ElasticService, which is impossible by simply restarting the windows service. I need to kill the process to be able to restart it just after.
When I'm looking the threads contention, calling Elastic API, i'm getting this :
0.0% (0s out of 500ms) cpu usage by thread 'threadDeathWatcher-2-1'
10/10 snapshots sharing following 4 elements
java.lang.Thread.sleep(Native Method)
io.netty.util.ThreadDeathWatcher$Watcher.run(ThreadDeathWatcher.java:152)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
java.lang.Thread.run(Thread.java:748)
0.0% (0s out of 500ms) cpu usage by thread 'DestroyJavaVM'
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#1]'
10/10 snapshots sharing following 27 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:251)
org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
org.elasticsearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnReplica(TransportShardRefreshAction.java:65)
org.elasticsearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnReplica(TransportShardRefreshAction.java:38)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:494)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:467)
org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147)
org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShard.java:1673)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:566)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:451)
org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:441)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#2]'
10/10 snapshots sharing following 39 elements
sun.nio.fs.WindowsNativeDispatcher.DeleteFile0(Native Method)
sun.nio.fs.WindowsNativeDispatcher.DeleteFile(WindowsNativeDispatcher.java:114)
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:249)
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
java.nio.file.Files.delete(Files.java:1126)
org.apache.lucene.store.FSDirectory.privateDeleteFile(FSDirectory.java:373)
org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:335)
org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:62)
org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:62)
org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:700)
org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:705)
org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38)
org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:723)
org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:717)
org.apache.lucene.index.IndexFileDeleter.deleteNewFiles(IndexFileDeleter.java:693)
org.apache.lucene.index.IndexWriter.deleteNewFiles(IndexWriter.java:4965)
org.apache.lucene.index.DocumentsWriter$DeleteNewFilesEvent.process(DocumentsWriter.java:771)
org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5043)
org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5034)
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:477)
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:266)
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:256)
org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:104)
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:690)
org.elasticsearch.index.IndexService.access$400(IndexService.java:92)
org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:832)
org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:743)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][flush][T#4334]'
10/10 snapshots sharing following 16 elements
org.apache.lucene.index.IndexWriter.setLiveCommitData(IndexWriter.java:3116)
org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:1562)
org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1063)
org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:780)
org.elasticsearch.indices.flush.SyncedFlushService.performPreSyncedFlush(SyncedFlushService.java:414)
org.elasticsearch.indices.flush.SyncedFlushService.access$1000(SyncedFlushService.java:70)
org.elasticsearch.indices.flush.SyncedFlushService$PreSyncedFlushTransportHandler.messageReceived(SyncedFlushService.java:696)
org.elasticsearch.indices.flush.SyncedFlushService$PreSyncedFlushTransportHandler.messageReceived(SyncedFlushService.java:692)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
It seems that a delete file is locking (deadlocking ?) Elastic threads. I'm not deleting any index on Production so I guess it's an internal ElasticSearch process about Lucene when the replica node is trying to synchronize with the master node, it should delete Lucene segments that doesn't exist anymore or something like that ..
I tried speaking with the Elastic development team, but being stuck on a delete file seems, in their opinion, an environment issue more than an Elastic issue, which is undertanstable actually.
I stopped Antivirus and backup process on these servers, but still getting these locks like one time per month minimum.
How inner Java "DeleteFile" can hangs without returning any error or something. It just hangs forever, the server didn't seem to be under pressure at the same time.
If anyone has encountered this kind of issue, or have an idea to help me investigate, it would be awesome.
Thanks !
Looks like others are experiencing this:
https://discuss.elastic.co/t/massive-queue-in-refresh-thread-pool-on-a-single-node-causing-timeouts/280732/4
Have you looked at the Windows Event Viewer Application Logs to see if any windows process is giving any insight?
Looks like it is trying to remove old index files.
org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
I want to read CSV file using Flink-API locally, by the following code:
csvPath="data/weather.csv";
List<Tuple2<String, Double>> csv= env.readCsvFile(csvPath)
.types(String.class,Double.class).collect();
I tried some files in different size(from 800mb to 6gb). Sometimes the operation is completed successfully and sometimes it is not, because of the following timeout exception:
Exception in thread "main" java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.ready(package.scala:169)
at org.apache.flink.runtime.minicluster.FlinkMiniCluster.shutdown(FlinkMiniCluster.scala:439)
at org.apache.flink.runtime.minicluster.FlinkMiniCluster.stop(FlinkMiniCluster.scala:408)
at org.apache.flink.client.LocalExecutor.stop(LocalExecutor.java:127)
at org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:195)
at org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:91)
at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:923)
at org.apache.flink.api.java.DataSet.collect(DataSet.java:410)
at org.apache.flink.simpleCSV.run(simpleCSV.java:83)
how can I fix this problem? increase this timeout programmatically? Or should I put a config file somewhere? Is there a specific heap size that I should set based on the file size?
collect() transfers the data from the cluster to the local client. This does only work for very small data sets (< 10 MB).
If you have larger data sets, you need to process them on the cluster and emit the results through an output format, e.g., write it to a file.
If you are debugging this program, you can set a break point at the constructor of org.apache.flink.api.java.LocalEnvironment (the constructor with config) and run the following command to change the timeout to 200 seconds (Alt+F8 in IntelliJ Idea):
config.setString("akka.ask.timeout", "200 s")
To find LocalEnvironment class in IntelliJ Idea, press Ctr+n, and check "Include non-project classes in the pop-up window, then type "LocalEnvironment" in the edit box.
Simple job: kafka->flatmap->reduce->map.
Job runs ok with default value for taskmanager.heap.mb (512Mb). According to the docs: this value should be as large as possible. Since the machine in question has 96Gb of RAM I set this to 75000 (arbitrary value).
Starting job gives this error:
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply$mcV$sp(JobManager.scala:563)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply(JobManager.scala:509)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply(JobManager.scala:509)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (Source: Custom Source (1/1)) # (unassigned) - [SCHEDULED] > with groupID < 95b239d1777b2baf728645df9a1c4232 > in sharing group < SlotSharingGroup [772c9ff1cf0b6cb3a361e3352f75fcee, d4f856f13654f424d7c49d0f00f6ecca, 81bb8c4310faefe32f97ebd6baa4c04f, 95b239d1777b2baf728645df9a1c4232] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:255)
at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:686)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
... 8 more
Restore the default value (512) to this parameter and the job runs ok. At 5000 it works -> at 10000 it doesn't.
What did I miss?
Edit: This is more hit-n-miss than I thought. Setting the value to 50000 and resubmitting gives success. In every test, the cluster is stopped and restarted.
What you are probably experiencing is submitting a job before the workers have registered at the master.
A 5GB JVM heap is initialized fast, and the TaskManager can register almost immediately. For a 70GB heap, the JVM takes a while to initialize and boot. Consequently, the worker registers later, and the job cannot be executed when you submit it, due to a lack of workers.
That is also the reason why it works once you re-submit the job.
JVMs are initialized faster, if you start the cluster in "streaming" mode (standalone via start-cluster-streaming.sh), because then at least Flink's internal memory is initialized lazily.
I have a Quartz Job that executes a Stored Procedure in my MySQL database once every 5 minutes, and for some reason, 1 out of 3 executions fails and gives this weird exception. I have searched and searched for what this exception means, but I could not find a solution. Here is the full stack trace:
java.sql.SQLException: Could not retrieve transation read-only status server
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1078)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:989)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:975)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:920)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:951)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:941)
at com.mysql.jdbc.ConnectionImpl.isReadOnly(ConnectionImpl.java:3939)
at com.mysql.jdbc.ConnectionImpl.isReadOnly(ConnectionImpl.java:3910)
at com.mysql.jdbc.PreparedStatement.checkReadOnlySafeStatement(PreparedStatement.java:1258)
at com.mysql.jdbc.CallableStatement.checkReadOnlySafeStatement(CallableStatement.java:2656)
at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1278)
at com.mysql.jdbc.CallableStatement.execute(CallableStatement.java:920)
at com.mchange.v2.c3p0.impl.NewProxyCallableStatement.execute(NewProxyCallableStatement.java:3044)
at org.deadmandungeons.website.tasks.RankUpdateTask.execute(RankUpdateTask.java:30)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 1,198,219 milliseconds ago. The last packet sent successfully to the server was 950,420 milliseconds ago.
at sun.reflect.GeneratedConstructorAccessor43.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1121)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3673)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3562)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4113)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2570)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2731)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2812)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2761)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1612)
at com.mysql.jdbc.ConnectionImpl.isReadOnly(ConnectionImpl.java:3933)
... 9 more
Caused by: java.net.SocketException: Connection timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:114)
at com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:161)
at com.mysql.jdbc.util.ReadAheadInputStream.read(ReadAheadInputStream.java:189)
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3116)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3573)
... 17 more
So I figured it is timing out because it thinks the MySQL server is in read-only status?
This only happens for this quartz job, and not any other time when I communicate with the database. This execution is of course happening in another thread, but I don't think that would have anything to do with it.
Why would it think the server was in read-only mode?
Also, I don't think "transation" is a word, so there's that...
Sorry for posting on old thread,
As stack trace says
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
This implies the link between JDBC and DB is broken.As per your observation you say 1 out of 3 job invocations fails.
You have these jobs scheduled every 5 minutes and as per trace the last successful message sent to server is ~15 minutes before.
Hence I suspect either
You are procedure is not returning (waiting on something)
The JDBC connection has been invalidated by the firewall/ proxy
It will interesting to see the how connections are managed, As per logs I see you are using c3p0.
You can try setting unreturnedConnectionTimeout and debugUnreturnedConnectionStackTraces. This will give you more insight into connection leaks or db calls which are taking long.
Research takes nowhere, as you guys said, but the error shows what seems to be a Database being populated by two applications at the same time.
Do you have admin privileges on this MySQL server? If you do, you should try setting
FLUSH TABLES WITH READ LOCK;
SET GLOBAL READ_ONLY=ON;
as a test to reproduce the error. Just to warn you, this command makes your database unwritable, so you will not be able to add data in it until you revert this configuration, obviously with
SET GLOBAL READ_ONLY=0;
UNLOCK TABLES;
If the result of this test is positive (same error had been reproduced), you should try isolating applications that are storing data on your database, to find out which one is conflicting with Quartz.
I'm sorry for being vague, but I hope it gives you some help...
I am experiencing some strange behaviour in my Java server application whereby database operations that usually take a few milliseconds are sporadically taking much longer (30s - 170s) to complete. This isn't isolated to a specific query as I've seen the delays occurring for both SQL update and select statements. Also, all of my select statements use the NOLOCK option so I've ruled out possible lock contention.
The last time I saw a delay I managed to capture the following stack trace from JConsole; the update in question typically takes 5ms to complete but this stack trace was accessible for at least 10 - 20 seconds. The trace suggests to me that the statement has been executed but there is some delay in retrieving the result although I could be wrong? Obviously as this was an update statement the only result I'd expect would be the row count (i.e. not a large result set of data).
I saw a "transport level error" in SQL Server Management Studio at around the time of the delay.
One suggestion I've had is that these problems are due to SQL Server resources being exhausted. Has anyone seen anything similar? Can anyone shed any light on this problem?
Thanks in advance.
Stack Trace:
Name: MessageRouterImplThread-2
State: RUNNABLE
Total blocked: 0 Total waited: 224
Stack trace:
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:129)
com.microsoft.util.UtilSocketDataProvider.getArrayOfBytes(Unknown Source)
com.microsoft.util.UtilBufferedDataProvider.cacheNextBlock(Unknown Source)
com.microsoft.util.UtilBufferedDataProvider.getArrayOfBytes(Unknown Source)
com.microsoft.jdbc.sqlserver.SQLServerDepacketizingDataProvider.signalStartOfPacket(Unknown Source)
com.microsoft.util.UtilDepacketizingDataProvider.getByte(Unknown Source)
com.microsoft.util.UtilByteOrderedDataReader.readInt8(Unknown Source)
com.microsoft.jdbc.sqlserver.tds.TDSRequest.getTokenType(Unknown Source)
com.microsoft.jdbc.sqlserver.tds.TDSRequest.processReply(Unknown Source)
com.microsoft.jdbc.sqlserver.SQLServerImplStatement.getNextResultType(Unknown Source)
com.microsoft.jdbc.base.BaseStatement.commonTransitionToState(Unknown Source)
com.microsoft.jdbc.base.BaseStatement.postImplExecute(Unknown Source)
com.microsoft.jdbc.base.BasePreparedStatement.postImplExecute(Unknown Source)
com.microsoft.jdbc.base.BaseStatement.commonExecute(Unknown Source)
com.microsoft.jdbc.base.BaseStatement.executeUpdateInternal(Unknown Source)
com.microsoft.jdbc.base.BasePreparedStatement.executeUpdate(Unknown Source)
- locked com.microsoft.jdbc.sqlserver.SQLServerConnection#c4b83f
org.apache.commons.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:101)
org.springframework.jdbc.core.JdbcTemplate$2.doInPreparedStatement(JdbcTemplate.java:798)
org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:591)
org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:792)
org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:850)
org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:858)
org.springframework.jdbc.core.simple.SimpleJdbcTemplate.update(SimpleJdbcTemplate.java:237)
"...whereby database operations that
usually take a few milliseconds are
sporadically taking much longer (30s -
170s) to complete."
What you are describing sounds like an incorrectly cached query plan, due to out of date statistics (and/or indexes that need rebuilding), or incorrect parameter sniffing. The timeout could be occuring because the server is taking longer than the default connection timeout.
I would talk to your DBA and first get statistics updated, and if that doesn't work get the indexes of the tables involved in the query rebuilt.
Run this on your Database (with the usual caveat about not runninhg in Production without talking to your admin/DBA, and run at own risk etc.):
EXEC sp_updatestats
EXEC sp_refreshview
EXEC sp_msForEachTable 'EXEC sp_recompile ''?'''
Alternatively, you mention time of day being a factor. Could it be that a backup or scheduled job is occuring at that time?
Update: You could kick off a profiler trace: MS SQL Server 2008 - How Can I Log and Find the Most Expensive Queries? but don't restrict to your DB. Such a trace, as long as it is started from SSMS as per that post, is relatively low impact (3-5% ish).
The "transport level error" seems to indicate connectivity problems. Is the database on a separate machine?