Below is the hive query am running,
INSERT INTO TABLE temp.table_output
SELECT /*+ STREAMTABLE(tableB) */ c.column1 as client, a.column2 as testData,
CASE WHEN ca.updated_date IS NULL OR ca.updated_date = 'null' THEN null ELSE CONCAT(ca.updated_date, '+0000') END as update
FROM temp.tableA as a
INNER JOIN default.tableB as ca ON a.column5=ca.column2
INNER JOIN default.tableC as c ON ca.column3=c.column1 WHERE a.name='test';
TableB is having 2.4 billion rows (140 GB), TableA and TableC is having 200 million records around.
Cluster consists of 3 Cassandra data nodes and 3 Analytics node (Hive on top of cassandra), with 130GB ram on each node.
TableA, TableB, TableC are hive internal tables.
Hive cluster heap size is 12GB.
Can someone tell me when I run hive query, am having heap issue and it fails to complete the task?? Its the only job running on hive server.
Task fails with below error,
Caused by: java.io.IOException: Read failed from file: cfs://172.31.x.x/tmp/hive-root/hive_2015-03-17_00-27-25_132_17376615815827139-1/-mr-10002/000049_0
at com.datastax.bdp.hadoop.cfs.CassandraInputStream.read(CassandraInputStream.java:178)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:59)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65)
... 16 more
Caused by: java.io.IOException: org.apache.thrift.TApplicationException: Internal error processing get_remote_cfs_sblock
at com.datastax.bdp.hadoop.cfs.CassandraFileSystemThriftStore.retrieveSubBlock(CassandraFileSystemThriftStore.java:537)
at com.datastax.bdp.hadoop.cfs.CassandraSubBlockInputStream.subBlockSeekTo(CassandraSubBlockInputStream.java:145)
at com.datastax.bdp.hadoop.cfs.CassandraSubBlockInputStream.read(CassandraSubBlockInputStream.java:95)
at com.datastax.bdp.hadoop.cfs.CassandraInputStream.read(CassandraInputStream.java:159)
... 25 more
Caused by: org.apache.thrift.TApplicationException: Internal error processing get_remote_cfs_sblock
at org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
at org.apache.cassandra.thrift.Dse$Client.recv_get_remote_cfs_sblock(Dse.java:271)
at org.apache.cassandra.thrift.Dse$Client.get_remote_cfs_sblock(Dse.java:254)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.datastax.bdp.util.CassandraProxyClient.invokeDseClient(CassandraProxyClient.java:655)
at com.datastax.bdp.util.CassandraProxyClient.invoke(CassandraProxyClient.java:631)
at com.sun.proxy.$Proxy5.get_remote_cfs_sblock(Unknown Source)
at com.datastax.bdp.hadoop.cfs.CassandraFileSystemThriftStore.retrieveSubBlock(CassandraFileSystemThriftStore.java:515)
... 28 more
Hive.log
2015-03-17 23:10:39,576 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_r_000023 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,579 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_r_000052 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,582 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_m_000207 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,585 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_r_000087 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,588 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_m_000223 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,591 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_m_000045 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,594 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_m_000235 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,597 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_m_002140 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,761 ERROR exec.Task (SessionState.java:printError(419)) -
Task with the most failures(4):
-----
Task ID:
task_201503171816_0036_m_000036
URL:
http://sjvtncasl064.mcafee.int:50030/taskdetails.jsp?jobid=job_201503171816_0036&tipid=task_201503171816_0036_m_000036
-----
Diagnostic Messages for this Task:
Error: Java heap space
2015-03-17 23:10:39,777 ERROR ql.Driver (SessionState.java:printError(419)) - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Most of the time, the hadoop errors on the tracker side are not very descriptive - like "there was a problem retrieving data from one of the nodes". In order to find out what had really happened, you need to get the system.log, both hive and hadoop task logs from each node, especially the one that did not return data back on time, and look at what the problem was there at the time of the error. In the Ops Center you can actually click on the hive job in progress and watch what is going on on each node, then see what was the error that caused your job interruption.
Here are some links that I found to be very useful. Some of these links are for older versions of DSE, but they still give a good start on how to optimize Hadoop operations and memory management.
http://www.datastax.com/dev/blog/tuning-dse-hadoop-map-reduce
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaHivTune.html
https://support.datastax.com/entries/23459322-Tuning-memory-for-Hadoop-tasks
https://support.datastax.com/entries/23472546-Specifying-the-number-of-concurrent-tasks-per-node
You may also want to read this article. Sometimes, timeouts may be due to major Garbage collections.
HTH
Related
On one of our platforms, HDFS namenode is shutting down with following error message every 1 or 3 days
FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [<ip1>:<port>,<ip2>:<port>, etc], stream=QuorumOutputStream starting at txid 29873171))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:109)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:525)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:55)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:521)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:710)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:188)
at java.lang.Thread.run(Thread.java:748)
Before this FATAL log we can see following kind of logs, on which we can detect a degradation of response time
WARN client.QuorumJournalManager (QuorumCall.java:waitFor(185)) - Waited 18014 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [<ip1>:<port>,<ip2>:<port>]
Have you already encountered this problem, and do you have any advices to fix it ?
We have already:
checked that our VMs are time synchronized
detected that when the problem occurs a burst of data on the network is in progress, without detecting the root cause yet
checked our network devices. Except a problem on a port which goes from UP state to DOWN state quickly that we are going to fix, the network seems correct
Thanks in advance
We discovered the issue with duplication due to an external dependency getting called numerous times, a value greater than actual size of the RDD. The result of the transformation is a consolidated RDD with size == input RDD size, I suspect this could be the reason why the problem is undiscoverable and there aren't many solutions documented online. We tried a couple of approaches, mentioned below -
Approach 1: Using mapPartitionsWithIndex function with persistPartition = true, result was the same - duplication of data.
Approach 2: Another stackoverflow solution recommended calling rdd.cache() before the mapPartition transformation since the cause was due to lazy evaluation and cache can invoke an immediate transformation. This seemed to resolve the issue but caching led to other memory related exceptions.
rdd = rdd.mapPartitionsWithIndex(partitionFunction, true);
Results from spark worker nodes in a 4 node cluster with RDD having size 1000
2019-08-11 13:48:47,487 [Executor task launch worker-0] INFO - Found 215 records in Partition 3
2019-08-11 13:48:47,634 [Executor task launch worker-0] INFO - Found 203 records in Partition 0
2019-08-11 13:49:44,472 [Executor task launch worker-0] INFO - Found 177 records in Partition 2
2019-08-11 13:49:46,252 [Executor task launch worker-0] INFO - Found 201 records in Partition 4
2019-08-11 13:49:44,537 [Executor task launch worker-0] INFO - Found 203 records in Partition 0
2019-08-11 13:48:47,170 [Executor task launch worker-0] INFO - Found 204 records in Partition 1
2019-08-11 13:48:47,410 [Executor task launch worker-0] INFO - Found 177 records in Partition 2
2019-08-11 13:49:16,875 [Executor task launch worker-0] INFO - Found 201 records in Partition 4
Total records across partitions: 1581
Any suggestions from the community on solving this issue? TIA
I recently installed DBvisualizier to query Hive tables. I installed it on my Mac and downloaded/installed the jdbc jar files for Hive from this website: https://s3.amazonaws.com/public-repo-1.hortonworks.com/HDP/hive-jdbc4/1.0.42.1054/Simba_HiveJDBC41_1.0.42.1054.zip
When I connected to our DB and test the queries. A simple select would work:
select *
from table_name
limit 10
But when I added 'order by' or 'group by':
select *
from table_name
order by rollingtime
limit 10
I got the following error which I have no clue why. Did anyone have similar error and know how to fix this?
09:56:17 START Executing for: 'NewDev' [Hive], Database: Hive, Schema: sdc
09:56:17 FAILED [SELECT - 0 rows, 0.504 secs] [Code: 500051, SQL State: HY000] [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 2, SQL state: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1516123265840_0008_8_00, diagnostics=[Task failed, taskId=task_1516123265840_0008_8_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : java.lang.NoClassDefFoundError: Could not initialize class org.apache.tez.runtime.library.api.TezRuntimeConfiguration
at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.start(OrderedPartitionedKVOutput.java:107)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:186)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:188)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:172)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:110)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
, errorMessage=Cannot recover from this error:java.lang.NoClassDefFoundError: Could not initialize class org.apache.tez.runtime.library.api.TezRuntimeConfiguration
at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.start(OrderedPartitionedKVOutput.java:107)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:186)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:188)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:172)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:110)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:1, Vertex vertex_1516123265840_0008_8_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1516123265840_0008_8_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1516123265840_0008_8_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1, Query: select *
from nomura_qa_mblock_capacity_stage
order by rollingtime
limit 10.
select *
from nomura_qa_mblock_capacity_stage
order by rollingtime
limit 10;
09:56:17 END Execution 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.504/0.000 secs [0 successful, 1 errors]
This is sometimes caused by an inability to write to the configured or hardcoded .staging directory usually because of an incorrect path or invalid permissions.
The mapreduce exec engine is more verbose than the tez engine in helping to identify the culprit which you can choose by running this query in your Hive shell:
SET hive.execution.engine=mr
You may then be able to see the following error:
Permission denied: user=dbuser, access=WRITE,
inode="/user/dbuser/.staging":hdfs:hdfs:drwxr-xr-x
In this case, the "dbuser" staging directory is specified to a non-existent path, it should be /home/dbuser/.staging
At runtime, before executing any queries that perform actions which would need to be staged (sort, order, group, distribute, etc) along with setting the exec engine to mr as previously shown, you would set the staging path to a valid parent directory such as your user's home dir by running the following query:
SET yarn.app.mapreduce.am.staging-dir=/home/dbuser/.staging
Depending on version and environment, if that directive doesn't work, you can try
SET hive.exec.stagingdir=/home/dbuser/.staging
Of course, change "dbuser" to your actual home directory (or any other to which you have read/write access). The .staging directory, assuming write access by the user running the query will automatically be created.
More info at http://doc.mapr.com/display/MapR/Default+mapred+Parameters
We have set up a 3 node cassandra cluster on Digital Ocean and have written some java program which uses the Java CQL driver to connect to the cassandra. The queries continue to run for a while, but after some time we are getting the following exception
Exception in thread "main" com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /128.199.98.201:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/128.199.98.201] Operation timed out))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:231)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:77)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1402)
at com.datastax.driver.core.Cluster.init(Cluster.java:164)
at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:343)
at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:316)
at com.datastax.driver.core.Cluster.connect(Cluster.java:254)
at com.attinad.cantiz.iot.platform.cassandrasample.PagingExample.connect(PagingExample.java:24)
at com.attinad.cantiz.iot.platform.cassandrasample.App.main(App.java:31)
The various time out values in the cassandra.yaml is given below
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 2000
counter_write_request_timeout_in_ms: 5000
cas_contention_timeout_in_ms: 1000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
Any idea whether the issue is time out related or is a coding problem... Any help is appreciated
I have a java webapp using an ibatis row handler to load a very large dataset (1 million rows in an innodb table). The process is run as a nightly cron job by quartz scheduler. However, after it processes for 6 minutes, it dies with the following stack trace:
WARN [DefaultQuartzScheduler_Worker-8] MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(168) | Could not invoke method 'doBatch' on target object [org.myCron#4adb34]
org.springframework.jdbc.UncategorizedSQLException: SqlMapClient operation: encountered SQLException [
--- The error occurred in org/myCron/mySqlMap.xml.
--- The error occurred while applying a result map.
--- Check the mySqlMap.outputMapping.
--- The error happened while setting a property on the result object.
--- Cause: com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
** BEGIN NESTED EXCEPTION **
java.io.EOFException
STACKTRACE:
java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1903)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2402)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2860)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6106)
at org.apache.commons.dbcp.DelegatingResultSet.next(DelegatingResultSet.java:168)
at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:592)
at com.ibatis.common.jdbc.logging.ResultSetLogProxy.invoke(ResultSetLogProxy.java:47)
at $Proxy10.next(Unknown Source)
at com.ibatis.sqlmap.engine.execution.SqlExecutor.handleResults(SqlExecutor.java:380)
at com.ibatis.sqlmap.engine.execution.SqlExecutor.handleMultipleResults(SqlExecutor.java:301)
at com.ibatis.sqlmap.engine.execution.SqlExecutor.executeQuery(SqlExecutor.java:190)
at com.ibatis.sqlmap.engine.mapping.statement.GeneralStatement.sqlExecuteQuery(GeneralStatement.java:205)
at com.ibatis.sqlmap.engine.mapping.statement.GeneralStatement.executeQueryWithCallback(GeneralStatement.java:173)
at com.ibatis.sqlmap.engine.mapping.statement.GeneralStatement.executeQueryWithRowHandler(GeneralStatement.java:133)
at com.ibatis.sqlmap.engine.impl.SqlMapExecutorDelegate.queryWithRowHandler(SqlMapExecutorDelegate.java:649)
at com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl.queryWithRowHandler(SqlMapSessionImpl.java:156)
at com.ibatis.sqlmap.engine.impl.SqlMapClientImpl.queryWithRowHandler(SqlMapClientImpl.java:133)
at org.springframework.orm.ibatis.SqlMapClientTemplate$5.doInSqlMapClient(SqlMapClientTemplate.java:267)
at org.springframework.orm.ibatis.SqlMapClientTemplate.execute(SqlMapClientTemplate.java:165)
at org.springframework.orm.ibatis.SqlMapClientTemplate.queryWithRowHandler(SqlMapClientTemplate.java:265)
at org.myCron.doBatch(MyCron.java:57)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:592)
at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:248)
at org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:165)
at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:66)
at org.quartz.core.JobRunShell.run(JobRunShell.java:191)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:516)
** END NESTED EXCEPTION **
The stack trace is very vague. The only hints that I see are 'the error happened while setting a property on the result object'. There are only two properties on the result object: a String and an Integer. Both of them permit null values, but my select statements indicate that neither of them have any null values. They both have a proper gettter/setter (which makes sense since the process runs for a while successfully before dying). Every time that the cron runs, it dies at a random point (so it isn't stuck on a particular row).
Note - The method 'doBatch' does exist since that is the method that starts the cron process. If it couldn't find doBatch, it couldn't successfully process the first thousand rows.
I've also tried runnning the job outside of quartz and it also fails there as well. We tried increasing our MySQL net_read_timeout, net_write_timeout, and delayed_insert_timeout but none of these settings helped with the problem. I also tried setting my log4j setting to DEBUG and I did not get any helpful info.
Any other ideas about what I could try?
Sounds like MySQL closed the connection for some reason. Check the MySQL log see if anything shows up. Turn on various logging options for MySQL if necessary.
Also, start printing debug data (including timestamp) from your app - just print everything, then see what the last action was - perhaps you have some rarely triggered conditions in your code that has a bug.
I.e. every single time you talk to MySQL log it before AND after.