apoc.periodic.iterate fails with exception: java.util.concurrent.RejectedExecutionException - java

I am trying run the annotation function of graphaware within Neo4J (see documentation here). I have a set of 5000 nodes (KnowledgeArticles) with textual data in the content property. To annotate those I run the following query in Neo4J desktop:
CALL apoc.periodic.iterate(
"MATCH (n:KnowledgeArticle) RETURN n",
"CALL ga.nlp.annotate({text: n.content, id: id(n)})
YIELD result MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)", {batchSize:1, iterateList:true})
After annotating approximately 200 to 300 KnowledgeArticles the database shuts down and provides the error:
Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure `apoc.periodic.iterate`: Caused by:
java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.FutureTask#373b81ee rejected from
java.util.concurrent.ThreadPoolExecutor#285a2901[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 288]
I have experimented using different values for batchSize or setting iterateList to false, but none of this helped.
Also, I have tried performing the above iterate call limiting it only to 150 nodes. This works fine for the first time I call it, but when I perform it for a second time it again provides the same error, stating that the completed_task is about 200 to 300. The processor in the back thus seems to 'remember' the amount of tasks it has run in total as of the first time the database has started.
Could you help me resolve this issue. I want to run the above query not necessarily from Neo4j desktop, but eventually with py2neo from Python using graph.run([iterate-query]). If there is thus any way of solving this from Python, that would be even better.
Thank you!
PS. The debug log provides the following output (as of the last few iterations of the annotation up until the shut down):
2019-05-21 12:46:10.359+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251906
2019-05-21 12:46:13.784+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] end storing annotatedText 251906. It took: 3425
2019-05-21 12:46:13.786+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.788+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.800+0000 INFO [c.g.n.u.ProcessorUtils] Taking default pipeline from configuration : myPipeline
2019-05-21 12:46:13.868+0000 INFO [c.g.n.p.s.StanfordTextProcessor] Time for pipeline annotation (myPipeline): 67. Text length: 954
2019-05-21 12:46:13.869+0000 INFO [c.g.n.NLPManager] Time to annotate 68
2019-05-21 12:46:13.869+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.869+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251907
2019-05-21 12:46:15.848+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] end storing annotatedText 251907. It took: 1978
2019-05-21 12:46:15.848+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:15.862+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:15.915+0000 INFO [c.g.n.u.ProcessorUtils] Taking default pipeline from configuration : myPipeline
2019-05-21 12:46:16.294+0000 INFO [c.g.n.p.s.StanfordTextProcessor] Time for pipeline annotation (myPipeline): 378. Text length: 2641
2019-05-21 12:46:16.295+0000 INFO [c.g.n.NLPManager] Time to annotate 379
2019-05-21 12:46:16.296+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:16.296+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251908
2019-05-21 12:46:16.421+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database graph.db is unavailable.
2019-05-21 12:46:17.018+0000 INFO [c.g.s.f.b.GraphAwareServerBootstrapper] stopped
2019-05-21 12:46:17.020+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-05-21 12:46:17.149+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutting down 'graph.db' database.
2019-05-21 12:46:17.150+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-05-21 12:46:17.164+0000 INFO [o.n.b.i.BackupServer] BackupServer communication server shutting down and unbinding from /127.0.0.1:6362
2019-05-21 12:46:17.226+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by database shutdown # txId: 7720 checkpoint started...
2019-05-21 12:46:17.247+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 7720 to [/Users/{my.user.name}/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-e2babea7-0332-4c2c-bf1d-076d4feed49a/installation-3.5.4/data/databases/graph.db/neostore.counts.db.a], from [/Users/{my.user.name}/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-e2babea7-0332-4c2c-bf1d-076d4feed49a/installation-3.5.4/data/databases/graph.db/neostore.counts.db.b].
2019-05-21 12:46:17.644+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by database shutdown # txId: 7720 checkpoint completed in 418ms
2019-05-21 12:46:17.647+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 3
2019-05-21 12:46:17.698+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics START ---
2019-05-21 12:46:17.700+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics END ---
2019-05-21 12:46:17.706+0000 INFO [c.g.r.BaseGraphAwareRuntime] Shutting down GraphAware Runtime...
2019-05-21 12:46:17.709+0000 INFO [c.g.r.m.BaseModuleManager] Shutting down module UIDM
2019-05-21 12:46:17.709+0000 INFO [c.g.r.m.BaseModuleManager] Shutting down module NLP
2019-05-21 12:46:17.712+0000 INFO [c.g.r.s.RotatingTaskScheduler] Terminating task scheduler...
2019-05-21 12:46:17.712+0000 INFO [c.g.r.s.RotatingTaskScheduler] Task scheduler terminated successfully.
2019-05-21 12:46:17.714+0000 INFO [c.g.r.BaseGraphAwareRuntime] GraphAware Runtime shut down.

Related

Kafka on the server failed the next day。(EndOfStreamException: Unable to read additional data from client,)

The Kafka service I deployed on the server passed the test a few days ago and was running normally. Later, no other operations were carried out.
Then the service call failed today. I went online and looked at it. It's normal these days. Just this afternoon, the service hung up. Looking at the log, it seems to say that the Kafka service is restarted automatically, and then conflicts with the ID of zookeeper.
Restart the service. Everything is normal. Google doesn't have a good solution. What's the reason?
The following is the relevant log, please help to see, thank you
[2021-11-06 14:43:29,139] WARN Unexpected exception (org.apache.zookeeper.server.NIOServerCnxn)
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /159.89.4.236:59168, session = 0x0
at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2021-11-06 15:00:18,301] INFO Completed load of Log(dir=/tmp/kafka-logs/__consumer_offsets-46, topicId=dPs0Kq4ERHKDaR440Nt8qA, topic=__consumer_offsets, partition=46, highWatermark=0, lastStableOffset=0, logStartOffset=0, logEndOffset=0) with 1 segments in 20ms (53/53 loaded in /tmp/kafka-logs) (kafka.log.LogManager)
[2021-11-06 15:00:18,303] INFO Loaded 53 logs in 2350ms. (kafka.log.LogManager)
[2021-11-06 15:00:18,303] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager)
[2021-11-06 15:00:18,304] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2021-11-06 15:00:18,838] INFO [BrokerToControllerChannelManager broker=0 name=forwarding]: Starting (kafka.server.BrokerToControllerRequestThread)
[2021-11-06 15:00:19,148] INFO Updated connection-accept-rate max connection creation rate to 2147483647 (kafka.network.ConnectionQuotas)
[2021-11-06 15:00:19,165] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.Acceptor)
[2021-11-06 15:00:19,237] INFO [SocketServer listenerType=ZK_BROKER, nodeId=0] Created data-plane acceptor and processors for endpoint : ListenerName(PLAINTEXT) (kafka.network.SocketServer)
[2021-11-06 15:00:19,282] INFO [BrokerToControllerChannelManager broker=0 name=alterIsr]: Starting (kafka.server.BrokerToControllerRequestThread)
[2021-11-06 15:00:19,400] INFO [ExpirationReaper-0-DeleteRecords]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2021-11-06 15:00:19,400] INFO [ExpirationReaper-0-Fetch]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2021-11-06 15:00:19,400] INFO [ExpirationReaper-0-Produce]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2021-11-06 15:00:19,409] INFO [ExpirationReaper-0-ElectLeader]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2021-11-06 15:00:19,451] INFO [LogDirFailureHandler]: Starting (kafka.server.ReplicaManager$LogDirFailureHandler)
[2021-11-06 15:00:19,530] INFO Creating /brokers/ids/0 (is it secure? false) (kafka.zk.KafkaZkClient)
[2021-11-06 15:00:19,566] ERROR Error while creating ephemeral at /brokers/ids/0, node already exists and owner '72057654275276801' does not match current session '72057654275276802' (kafka.zk.KafkaZkClient$CheckedEphemeral)
[2021-11-06 15:00:19,572] ERROR [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
at org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1904)
at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1842)
at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1809)
at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:96)
at kafka.server.KafkaServer.startup(KafkaServer.scala:319)
at kafka.Kafka$.main(Kafka.scala:109)
at kafka.Kafka.main(Kafka.scala)
[2021-11-06 15:00:19,596] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer)
[2021-11-06 15:00:19,597] INFO [SocketServer listenerType=ZK_BROKER, nodeId=0] Stopping socket server request processors (kafka.network.SocketServer)```

Apache Camel Route is getting started automatically

Route loadfile is getting started automatically when I start main class.
On exception, when process should finish. It starts loadfile again and again.
It should get start from timer and then should call loadfile route, but loadfile is starting independent as well as from timer.
CamelContext context = new DefaultCamelContext(sr);
try {
context.addRoutes(new RouteBuilder() {
#Override
public void configure() throws Exception {
onException(Exception.class)
.log(LoggingLevel.INFO, "Extype:${exception.message}")
.stop();
from("timer://alertstrigtimer?period=60s&repeatCount=1")
.startupOrder(1)
.log(LoggingLevel.INFO, "*******************************Job-Alert-System: Started: alertstrigtimer******************************")
.to("direct:loadFile").stop();
from("direct:loadFile").routeId("loadfile")
.log(LoggingLevel.INFO, "*******************************Job-Alert-System: Started: direct:loadFile******************************")
.from(getTriggerFileURI(getWorkFilePath(), getWorkFileName())).choice()
.
.
});
context.start();
Thread.sleep(40000);
Following is log:
[main] INFO org.apache.camel.impl.DefaultCamelContext - Apache Camel 2.21.1 (CamelContext: camel-1) is starting
[main] INFO org.apache.camel.management.ManagedManagementStrategy - JMX is enabled
[main] INFO org.apache.camel.impl.converter.DefaultTypeConverter - Type converters loaded (core: 194, classpath: 14)
[main] INFO org.apache.camel.impl.DefaultCamelContext - StreamCaching is not in use. If using streams then its recommended to enable stream caching. See more details at http://camel.apache.org/stream-caching.html
[main] INFO org.apache.camel.impl.DefaultCamelContext - Route: route1 started and consuming from: timer://alertstrigtimer?period=60s&repeatCount=1
[main] INFO org.apache.camel.impl.DefaultCamelContext - Skipping starting of route loadfile as its configured with autoStartup=false
[main] INFO org.apache.camel.impl.DefaultCamelContext - Route: loadDataAndAlerts started and consuming from: direct://loadDataAndAlerts
[main] INFO org.apache.camel.impl.DefaultCamelContext - Total 4 routes, of which 2 are started
[main] INFO org.apache.camel.impl.DefaultCamelContext - Apache Camel 2.21.1 (CamelContext: camel-1) started in 0.761 seconds
[Camel (camel-1) thread #1 - timer://alertstrigtimer] INFO route1 - *******************************Job-Alert-System: Started: alertstrigtimer******************************
[Camel (camel-1) thread #2 - timer://alertstrigtimer] INFO loadfile - *******************************Job-Alert-System: Started: direct:loadFile******************************
[Camel (camel-1) thread #1 - file://null] INFO loadfile - *******************************Job-Alert-System: Started: direct:loadFile******************************
The problem could be cause by this line .from(getTriggerFileURI(getWorkFilePath(), getWorkFileName())) in loadfile route. Route with multiple from endpoint is known as Multiple Input and this pattern is removed in Camel 3.x.
From RedHat,
from("URI1").from("URI2").from("URI3").to("DestinationUri");
..., exchanges from each of the input endpoints,
URI1, URI2, and URI3, are processed independently of each other and in
separate threads. In fact, you can think of the preceding route as
being equivalent to the following three separate routes:
from("URI1").to("DestinationUri");
from("URI2").to("DestinationUri");
from("URI3").to("DestinationUri");
Rather than using multiple from endpoint (extra independent input), try content enricher pattern (pollEnrich for file component).

Poor performance on Grails 3.2.9 application startup

Recently I've been doing an update of an application from Grails 2.5.5 to Grails 3.2.9.
An application is serving ~3K RPM.
The issue I currently have is a poor performance after application startup. Our normal release process works in the following way (assuming 2 nodes setup):
2 nodes with service are running.
Turn off the first node.
Release new version on the first node.
Service on the first node register itself in Eureka and start consuming requests.
(repeat same on second nodes)
Problems are starting to appear on the 4th step. The application responds quite slowly and timing is really inconsistent - some responses are in an expected time range, but some are really off normal.
Sample logs:
2017-09-01 08:03:38,594 INFO [request][http-nio-12345-exec-72][][]- END controller=98ms
2017-09-01 08:03:38,911 INFO [request][http-nio-12345-exec-101][][]- END controller=134ms
2017-09-01 08:03:38,948 INFO [request][http-nio-12345-exec-56][][]- END controller=211ms
2017-09-01 08:03:39,156 INFO [request][http-nio-12345-exec-82][][]- END controller=95ms
2017-09-01 08:03:39,124 INFO [request][http-nio-12345-exec-111][][]- END controller=98ms
2017-09-01 08:03:39,184 INFO [request][http-nio-12345-exec-110][][]- END controller=4099ms
2017-09-01 08:03:39,399 INFO [request][http-nio-12345-exec-46][][]- END controller=24ms
2017-09-01 08:03:39,428 INFO [request][http-nio-12345-exec-43][][]- END controller=191ms
2017-09-01 08:03:39,744 INFO [request][http-nio-12345-exec-83][][]- END controller=117ms
2017-09-01 08:03:40,335 INFO [request][http-nio-12345-exec-56][][]- END controller=483ms
2017-09-01 08:03:45,595 INFO [request][http-nio-12345-exec-110][][]- END controller=5623ms
2017-09-01 08:03:45,618 INFO [request][http-nio-12345-exec-83][][]- END controller=5274ms
2017-09-01 08:03:45,629 INFO [request][http-nio-12345-exec-144][][]- END controller=2007ms
2017-09-01 08:03:45,671 INFO [request][http-nio-12345-exec-119][][]- END controller=4591ms
As you can see from it - few requests went below 100ms and some - more than 5 seconds.
My assumption is that’s happening due to slow warm up of Grails 3 application and lazy class loading.
Things I've already done:
grais.gorm.autowire = false
grais.gorm.reactor.events = false
Delay registration of service in Eureka for 30 seconds (to wait while applications is fully loaded)
The next thing which comes to my mind is to compile the project with #CompileStatic annotation.

MapReduce input from HTable on AWS timeout

I'm having a bit of trouble figuring out how to excute a simple MapReduce job with input to be sourced from HTable using emr-5.4.0.
when I ran on ERM, it failed because of time out.(emr-5.3.0 also failed)
I have done a bunch of google searching to find out how to proceed, but could'nt find anything useful.
my process:
I created a EMR cluster using Hbase.The version is:
Amazon 2.7.3, Ganglia 3.7.2, HBase 1.3.0, Hive 2.1.1, Hue 3.11.0,
Phoenix 4.9.0
According to the sample from the manual:http://hbase.apache.org/book.html#mapreduce.example, and write my job likes:
public class TableMapTest3 {
// TableMapper
public static class MyMapper extends TableMapper<Text, Text> {
protected void map(ImmutableBytesWritable key, Result inputValue, Context context)
throws IOException, InterruptedException {
String keyS = new String(key.get(), "UTF-8");
String value = new String(inputValue.getValue(Bytes.toBytes("contents"), Bytes.toBytes("name")), "UTF-8");
System.out.println("TokenizerMapper :" + value);
context.write(new Text(keyS), new Text(value));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
System.out.println("url:" + conf.get("fs.defaultFS"));
System.out.println("hbase.zookeeper.quorum:" + conf.get("hbase.zookeeper.quorum"));
Connection conn = ConnectionFactory.createConnection(conf);
Admin admin = conn.getAdmin();
String tableName = "TableMapTest";
TableName tablename = TableName.valueOf(tableName);
Table hTable = null;
// check table exists
if (admin.tableExists(tablename)) {
System.out.println(tablename + " table existed...");
hTable = conn.getTable(tablename);
ResultScanner resultScanner = hTable.getScanner(new Scan());
for (Result result : resultScanner) {
Delete delete = new Delete(result.getRow());
hTable.delete(delete);
}
} else {
HTableDescriptor tableDesc = new HTableDescriptor(tablename);
tableDesc.addFamily(new HColumnDescriptor("contents"));
admin.createTable(tableDesc);
System.out.println(tablename + " table created...");
hTable = conn.getTable(tablename);
}
// insert data
for (int i = 0; i < 20; i++) {
Put put = new Put(Bytes.toBytes(String.valueOf(i)));
put.addColumn(Bytes.toBytes("contents"), Bytes.toBytes("name"), Bytes.toBytes("value" + i));
hTable.put(put);
}
hTable.close();
// Hadoop
Job job = Job.getInstance(conf, TableMapTest3.class.getSimpleName());
job.setJarByClass(TableMapTest3.class);
job.setOutputFormatClass(NullOutputFormat.class);
Scan scan = new Scan();
TableMapReduceUtil.initTableMapperJob(tableName, scan, MyMapper.class, Text.class, Text.class, job);
System.out.println("TableMapTest result:" + job.waitForCompletion(true));
}
}
package my source to jar and upload it to the cluster. then I ssh on the master and run my job:
hadoop jar zz-0.0.1.jar com.ziki.zz.TableMapTest3
I got the follow messages:
url:hdfs://ip-xxx.ap-northeast-1.compute.internal:8020
hbase.zookeeper.quorum:localhost
TableMapTest table created...
17/05/05 01:31:23 INFO impl.TimelineClientImpl: Timeline service address: http://ip-xxx.ap-northeast-1.compute.internal:8188/ws/v1/timeline/
17/05/05 01:31:23 INFO client.RMProxy: Connecting to ResourceManager at ip-xxx.ap-northeast-1.compute.internal/172.31.4.228:8032
17/05/05 01:31:24 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/05/05 01:31:31 INFO mapreduce.JobSubmitter: number of splits:1
17/05/05 01:31:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1493947058255_0001
17/05/05 01:31:33 INFO impl.YarnClientImpl: Submitted application application_1493947058255_0001
17/05/05 01:31:34 INFO mapreduce.Job: The url to track the job: http://ip-xxx.ap-northeast-1.compute.internal:20888/proxy/application_1493947058255_0001/
17/05/05 01:31:34 INFO mapreduce.Job: Running job: job_1493947058255_0001
17/05/05 01:31:57 INFO mapreduce.Job: Job job_1493947058255_0001 running in uber mode : false
17/05/05 01:31:57 INFO mapreduce.Job: map 0% reduce 0%
after a well, i get the error:
17/05/05 01:42:26 INFO mapreduce.Job: Task Id : attempt_1493947058255_0001_m_000000_0, Status : FAILED
AttemptID:attempt_1493947058255_0001_m_000000_0 Timed out after 600 secs
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
17/05/05 01:52:56 INFO mapreduce.Job: Task Id : attempt_1493947058255_0001_m_000000_1, Status : FAILED
AttemptID:attempt_1493947058255_0001_m_000000_1 Timed out after 600 secs
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
and some syslogs
2017-05-05 01:31:59,664 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1493947058255_0001_m_000000 Task Transitioned from SCHEDULED to RUNNING
2017-05-05 01:32:08,168 INFO [Socket Reader #1 for port 33348] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1493947058255_0001 (auth:SIMPLE)
2017-05-05 01:32:08,227 INFO [IPC Server handler 0 on 33348] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : jvm_1493947058255_0001_m_000002 asked for a task
2017-05-05 01:32:08,231 INFO [IPC Server handler 0 on 33348] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1493947058255_0001_m_000002 given task: attempt_1493947058255_0001_m_000000_0
2017-05-05 01:42:25,382 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1493947058255_0001_m_000000_0: AttemptID:attempt_1493947058255_0001_m_000000_0 Timed out after 600 secs
2017-05-05 01:42:25,389 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1493947058255_0001_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
2017-05-05 01:42:25,392 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1493947058255_0001_01_000002 taskAttempt attempt_1493947058255_0001_m_000000_0
2017-05-05 01:42:25,392 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1493947058255_0001_m_000000_0
2017-05-05 01:42:25,394 INFO [ContainerLauncher #1] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : ip-xxx.ap-northeast-1.compute.internal:8041
2017-05-05 01:42:25,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1493947058255_0001_m_000000_0 TaskAttempt Transitioned from FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP
2017-05-05 01:42:25,458 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: TASK_ABORT
2017-05-05 01:42:25,460 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1493947058255_0001_m_000000_0 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED
2017-05-05 01:42:25,495 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved ip-xxx.ap-northeast-1.compute.internal to /default-rack
2017-05-05 01:42:25,500 INFO [Thread-83] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on node ip-xxx.ap-northeast-1.compute.internal
2017-05-05 01:42:25,502 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1493947058255_0001_m_000000_1 TaskAttempt Transitioned from NEW to UNASSIGNED
2017-05-05 01:42:25,503 INFO [Thread-83] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Added attempt_1493947058255_0001_m_000000_1 to list of failed maps
2017-05-05 01:42:25,557 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:3 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2017-05-05 01:42:25,582 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1493947058255_0001: ask=1 release= 0 newContainers=0 finishedContainers=1 resourcelimit=<memory:1024, vCores:1> knownNMs=2
2017-05-05 01:42:25,582 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1493947058255_0001_01_000002
2017-05-05 01:42:25,583 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1493947058255_0001_m_000000_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
I just use the default settings and run a simple job.why these errors happend?
if i'm missing anything, let me know!
anyway, thanks for your help - appreciate it!
I found the answer from : here
you cant use HConfiguration (because it defaults to a localhost quorum) What you'll have to do is use the configuration that amazon sets up for you (located in /etc/hbase/conf/hbase-site.xml)
The connection code looks like this:
Configuration conf = new Configuration();
String hbaseSite = "/etc/hbase/conf/hbase-site.xml";
conf.addResource(new File(hbaseSite).toURI().toURL());
HBaseAdmin.checkHBaseAvailable(conf);

Quartz Job executed multiple times simultaneously by each cluster machine, rather than one time by one machine for the entire cluster

Goal:
* Have Job1 run once for a three-node cluster every 10 minutes, and Job2 run once for the same cluster every 5 minutes. Each job generates an email; so at 10:55am I should receive only one Job2 email from the cluster, and at 11:00am I should receive one Job1 email and one Job2 email from the cluster, at 11:05am I should receive only one Job2 email from the cluster, and so on...
Problem:
* Job1 is being run multiple times every 10 minutes on each node in the cluster, and the same for Job2 (except every 5 minutes). This leads to many, many more than one or two emails.
Configuration:
* Three-node linux cluster
* Each machine NTP configured and time-sync'd
* Oracle DB
* Quartz v2.2.0 (cluster mode)
* Jobs configured via CronTrigger
* Each node has an instance of the same standalone Java application running on it, and the Java application instantiates an instance of the quartz scheduler in cluster-mode.
* quartz.properties files are identical on each machine.
I have investigated all the obvious potential causes, but nothing explains it or presents a fix. I have even tried inserting an artificial 10-second sleep instruction in the job, to ensure that it doesn't finish in under a second. Please find relevant artifacts below (quartz.properties and log output). Any help would be greatly appreciated!
Artifact #1:
============================================================================
============================================================================
Q U A R T Z --- P R O P E R T I E S
==================
#============================================================================
# Configure Main Scheduler Properties
#============================================================================
org.quartz.scheduler.instanceName: MyQrtzScheduler
org.quartz.scheduler.instanceId: AUTO
org.quartz.scheduler.skipUpdateCheck: true
#============================================================================
# Configure ThreadPool
#============================================================================
org.quartz.threadPool.class: org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount: 1
org.quartz.threadPool.threadPriority: 5
#============================================================================
# Configure JobStore
#============================================================================
org.quartz.jobStore.misfireThreshold: 2592000000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.OracleDelegate
org.quartz.jobStore.useProperties=false
org.quartz.jobStore.dataSource=myDS
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=60000
#============================================================================
# Other Example Delegates
#============================================================================
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.DB2v6Delegate
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.DB2v7Delegate
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.DriverDelegate
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.HSQLDBDelegate
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.MSSQLDelegate
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PointbaseDelegate
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.WebLogicDelegate
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.OracleDelegate
#org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.WebLogicOracleDelegate
#============================================================================
# Configure Datasources
#============================================================================
org.quartz.dataSource.myDS.driver: oracle.jdbc.driver.OracleDriver
org.quartz.dataSource.myDS.URL: jdbc:oracle:thin:#myServer:myPort:blah
org.quartz.dataSource.myDS.user: myDBUser
org.quartz.dataSource.myDS.password: myDBPassword
org.quartz.dataSource.myDS.maxConnections: 2
org.quartz.dataSource.myDS.validationQuery: select 0
#============================================================================
# Configure Plugins
#============================================================================
org.quartz.plugin.shutdownHook.class: org.quartz.plugins.management.ShutdownHookPlugin
org.quartz.plugin.shutdownHook.cleanShutdown: true
org.quartz.plugin.triggerHistory.class=org.quartz.plugins.history.LoggingTriggerHistoryPlugin
org.quartz.plugin.jobHistory.class=org.quartz.plugins.history.LoggingJobHistoryPlugin
Artifact #2:
============================================================================
============================================================================
L O G --- O U T P U T
==================
2015-01-29 12:56:16,602 [main] INFO com.mycompany.myapp.jobs.QuartzHelper - Initializing Quartz scheduler...
2015-01-29 12:56:16,829 [main] INFO org.quartz.impl.StdSchedulerFactory - Using default implementation for ThreadExecutor
2015-01-29 12:56:16,855 [main] INFO org.quartz.core.SchedulerSignalerImpl - Initialized Scheduler Signaller of type: class org.quartz.core.SchedulerSignalerImpl
2015-01-29 12:56:16,855 [main] INFO org.quartz.core.QuartzScheduler - Quartz Scheduler v.2.2.0 created.
2015-01-29 12:56:16,857 [main] INFO org.quartz.plugins.management.ShutdownHookPlugin - Registering Quartz shutdown hook.
2015-01-29 12:56:16,859 [main] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - Using db table-based data access locking (synchronization).
2015-01-29 12:56:16,864 [main] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - JobStoreTX initialized.
2015-01-29 12:56:16,865 [main] INFO org.quartz.core.QuartzScheduler - Scheduler meta-data: Quartz Scheduler (v2.2.0) 'MyQrtzScheduler' with instanceId 'node1_1422554176832'
Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
NOT STARTED.
Currently in standby mode.
Number of jobs executed: 0
Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 1 threads.
Using job-store 'org.quartz.impl.jdbcjobstore.JobStoreTX' - which supports persistence. and is clustered.
2015-01-29 12:56:16,865 [main] INFO org.quartz.impl.StdSchedulerFactory - Quartz scheduler 'MyQrtzScheduler' initialized from specified file: '/my/install/directory/quartz.properties'
2015-01-29 12:56:16,866 [main] INFO org.quartz.impl.StdSchedulerFactory - Quartz scheduler version: 2.2.0
2015-01-29 12:56:16,866 [main] INFO com.mycompany.myapp.jobs.QuartzHelper - Quartz scheduler initialized successfully.
2015-01-29 12:59:53,450 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.core.QuartzSchedulerThread - batch acquisition of 1 triggers
2015-01-29 13:00:00,007 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' is desired by: MyQrtzScheduler_QuartzSchedulerThread
2015-01-29 13:00:00,008 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' is being obtained: MyQrtzScheduler_QuartzSchedulerThread
2015-01-29 13:00:00,809 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' given to: MyQrtzScheduler_QuartzSchedulerThread
2015-01-29 13:00:00,836 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' returned by: MyQrtzScheduler_QuartzSchedulerThread
2015-01-29 13:00:00,839 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.simpl.PropertySettingJobFactory - Producing instance of Job 'node2_1422546730757.Job1', class=com.mycompany.myapp.job.Job1
2015-01-29 13:00:00,851 [MyQrtzScheduler_Worker-1] INFO org.quartz.plugins.history.LoggingTriggerHistoryPlugin - Trigger node2_1422546730757.Job1Trigger fired job node2_1422546730757.Job1 at: 13:00:00 01/29/2015
2015-01-29 13:00:00,852 [MyQrtzScheduler_Worker-1] INFO org.quartz.plugins.history.LoggingJobHistoryPlugin - Job node2_1422546730757.Job1 fired (by trigger node2_1422546730757.Job1Trigger) at: 13:00:00 01/29/2015
2015-01-29 13:00:00,852 [MyQrtzScheduler_Worker-1] DEBUG org.quartz.core.JobRunShell - Calling execute on job node2_1422546730757.Job1
2015-01-29 13:00:00,853 [MyQrtzScheduler_Worker-1] INFO com.mycompany.myapp.job.Job1 - ***Executing Inbound File SLA Job...
2015-01-29 13:00:02,054 [MyQrtzScheduler_Worker-1] INFO com.mycompany.myapp.job.Job1 - ***Inbound File SLA Job: No SLA breaches found...
2015-01-29 13:00:02,150 [MyQrtzScheduler_Worker-1] INFO com.mycompany.myapp.job.Job1 - Job1 completed successfully in [1297ms]; sleeping [63703ms] to meet the required minimum runtime for quartz-jobs
2015-01-29 13:00:24,881 [QuartzScheduler_MyQrtzScheduler-node1_1422554176832_ClusterManager] DEBUG org.quartz.impl.jdbcjobstore.JobStoreTX - ClusterManager: Check-in complete.
2015-01-29 13:01:05,862 [MyQrtzScheduler_Worker-1] INFO com.mycompany.myapp.job.Job1 - Job1 sleep-delay completed.
2015-01-29 13:01:05,864 [MyQrtzScheduler_Worker-1] INFO org.quartz.plugins.history.LoggingJobHistoryPlugin - Job node2_1422546730757.Job1 execution complete at 13:01:05 01/29/2015 and reports: SUCCESS
2015-01-29 13:01:05,865 [MyQrtzScheduler_Worker-1] INFO org.quartz.plugins.history.LoggingTriggerHistoryPlugin - Trigger node2_1422546730757.Job1Trigger completed firing job node2_1422546730757.Job1 at 13:01:05 01/29/2015 with resulting trigger instruction code: DO NOTHING
2015-01-29 13:01:05,868 [MyQrtzScheduler_Worker-1] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' is desired by: MyQrtzScheduler_Worker-1
2015-01-29 13:01:05,869 [MyQrtzScheduler_Worker-1] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' is being obtained: MyQrtzScheduler_Worker-1
2015-01-29 13:01:05,872 [MyQrtzScheduler_Worker-1] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' given to: MyQrtzScheduler_Worker-1
2015-01-29 13:01:05,880 [MyQrtzScheduler_Worker-1] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' returned by: MyQrtzScheduler_Worker-1
2015-01-29 13:01:05,915 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.core.QuartzSchedulerThread - batch acquisition of 1 triggers
2015-01-29 13:01:05,917 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' is desired by: MyQrtzScheduler_QuartzSchedulerThread
2015-01-29 13:01:05,918 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' is being obtained: MyQrtzScheduler_QuartzSchedulerThread
2015-01-29 13:01:05,921 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' given to: MyQrtzScheduler_QuartzSchedulerThread
2015-01-29 13:01:05,954 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore - Lock 'TRIGGER_ACCESS' returned by: MyQrtzScheduler_QuartzSchedulerThread
2015-01-29 13:01:05,955 [MyQrtzScheduler_QuartzSchedulerThread] DEBUG org.quartz.simpl.PropertySettingJobFactory - Producing instance of Job 'node1_1422543657050.Job2', class=com.mycompany.myapp.jobs.Job2
2015-01-29 13:01:05,961 [MyQrtzScheduler_Worker-1] INFO org.quartz.plugins.history.LoggingTriggerHistoryPlugin - Trigger node1_1422543657050.Job2Trigger fired job node1_1422543657050.Job2 at: 13:01:05 01/29/2015
2015-01-29 13:01:05,962 [MyQrtzScheduler_Worker-1] INFO org.quartz.plugins.history.LoggingJobHistoryPlugin - Job node1_1422543657050.Job2 fired (by trigger node1_1422543657050.Job2Trigger) at: 13:01:05 01/29/2015
2015-01-29 13:01:05,963 [MyQrtzScheduler_Worker-1] DEBUG org.quartz.core.JobRunShell - Calling execute on job node1_1422543657050.Job2
2015-01-29 13:01:05,963 [MyQrtzScheduler_Worker-1] WARN com.mycompany.myapp.jobs.Job2 - No outbound files found; Outbound File SLA Job cannot check for SLA breaches.
2015-01-29 13:01:05,965 [MyQrtzScheduler_Worker-1] INFO org.quartz.plugins.history.LoggingJobHistoryPlugin - Job node1_1422543657050.Job2 execution complete at 13:01:05 01/29/2015 and reports: null
2015-01-29 13:01:05,966 [MyQrtzScheduler_Worker-1] INFO org.quartz.plugins.history.LoggingTriggerHistoryPlugin - Trigger node1_1422543657050.Job2Trigger completed firing job node1_1422543657050.Job2 at 13:01:05 01/29/2015 with resulting trigger instruction code: DO NOTHING
The following answer was given by the OP.
The problem was that I was defining quartz jobs with identities that have a unique group id (the scheduler id) instead of a group id common to all hosts in the cluster. Since the scheduler id is unique to the host, each host in the cluster would look to see if that job already existed using the fully qualified job name groupId.jobName and surely it found it didn't, so it would create a new instance of Job1 and Job2 during startup. The quartz jobs/triggers are never expired or cleared without an explicit request in Java or manual sql statement in Oracle. So over time the instances would build up, and instead of quartz running a single instance of Job1 and Job2, it would run all the instances of each job that had been created over time (hence the multiple executions and multiple email alerts).
The solution is that I replace schedulerId with a static string such as "MyQuartzJobs" when defining a job's identity.
Basically, I changed the following line of Java code:
JobDetail job =
newJob(Job1.class).withIdentity(JOB1_JOB_NAME, uniqueSchedulerId)
.withDescription(JOB1_DESC + " created [" + new Date() + "]")
.storeDurably(false)
.requestRecovery(false)
.build();
to something like the following:
JobDetail job =
newJob(Job1.class).withIdentity(JOB1_JOB_NAME, "MyQuartzJobs")
.withDescription(JOB1_DESC + " created [" + new Date() + "]")
.storeDurably(false)
.requestRecovery(false)
.build();

Categories