I am running a local Yarn Cluster with 8 vCores and 8Gb total memory.
The workflow is as such:
YarnClient submits an app request that starts the AppMaster in a container.
AppMaster start, creates amRMClient and nmClient, register itself to the RM and next it creates 4 container requests for worker threads via amRMClient.addContainerRequest
Even though there are enough resources available containers are not allocated (The callback's function onContainersAllocated is never called). I tried inspecting nodemanager's and resourcemanager's logs and I don't see any line related to the container requests. I followed closely apache docs and can't understand what I`m doing wrong.
For reference here is the AppMaster code:
#Override
public void run() {
Map<String, String> envs = System.getenv();
String containerIdString = envs.get(ApplicationConstants.Environment.CONTAINER_ID.toString());
if (containerIdString == null) {
// container id should always be set in the env by the framework
throw new IllegalArgumentException("ContainerId not set in the environment");
}
ContainerId containerId = ConverterUtils.toContainerId(containerIdString);
ApplicationAttemptId appAttemptID = containerId.getApplicationAttemptId();
LOG.info("Starting AppMaster Client...");
YarnAMRMCallbackHandler amHandler = new YarnAMRMCallbackHandler(allocatedYarnContainers);
// TODO: get heart-beet interval from config instead of 100 default value
amClient = AMRMClientAsync.createAMRMClientAsync(1000, this);
amClient.init(config);
amClient.start();
LOG.info("Starting AppMaster Client OK");
//YarnNMCallbackHandler nmHandler = new YarnNMCallbackHandler();
containerManager = NMClient.createNMClient();
containerManager.init(config);
containerManager.start();
// Get port, ulr information. TODO: get tracking url
String appMasterHostname = NetUtils.getHostname();
String appMasterTrackingUrl = "/progress";
// Register self with ResourceManager. This will start heart-beating to the RM
RegisterApplicationMasterResponse response = null;
LOG.info("Register AppMaster on: " + appMasterHostname + "...");
try {
response = amClient.registerApplicationMaster(appMasterHostname, 0, appMasterTrackingUrl);
} catch (YarnException | IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
return;
}
LOG.info("Register AppMaster OK");
// Dump out information about cluster capability as seen by the resource manager
int maxMem = response.getMaximumResourceCapability().getMemory();
LOG.info("Max mem capabililty of resources in this cluster " + maxMem);
int maxVCores = response.getMaximumResourceCapability().getVirtualCores();
LOG.info("Max vcores capabililty of resources in this cluster " + maxVCores);
containerMemory = Integer.parseInt(config.get(YarnConfig.YARN_CONTAINER_MEMORY_MB));
containerCores = Integer.parseInt(config.get(YarnConfig.YARN_CONTAINER_CPU_CORES));
// A resource ask cannot exceed the max.
if (containerMemory > maxMem) {
LOG.info("Container memory specified above max threshold of cluster."
+ " Using max value." + ", specified=" + containerMemory + ", max="
+ maxMem);
containerMemory = maxMem;
}
if (containerCores > maxVCores) {
LOG.info("Container virtual cores specified above max threshold of cluster."
+ " Using max value." + ", specified=" + containerCores + ", max=" + maxVCores);
containerCores = maxVCores;
}
List<Container> previousAMRunningContainers = response.getContainersFromPreviousAttempts();
LOG.info("Received " + previousAMRunningContainers.size()
+ " previous AM's running containers on AM registration.");
for (int i = 0; i < 4; ++i) {
ContainerRequest containerAsk = setupContainerAskForRM();
amClient.addContainerRequest(containerAsk); // NOTHING HAPPENS HERE...
LOG.info("Available resources: " + amClient.getAvailableResources().toString());
}
while(completedYarnContainers != 4) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
LOG.info("Done with allocation!");
}
#Override
public void onContainersAllocated(List<Container> containers) {
LOG.info("Got response from RM for container ask, allocatedCnt=" + containers.size());
for (Container container : containers) {
LOG.info("Allocated yarn container with id: {}" + container.getId());
allocatedYarnContainers.push(container);
// TODO: Launch the container in a thread
}
}
#Override
public void onError(Throwable error) {
LOG.error(error.getMessage());
}
#Override
public float getProgress() {
return (float) completedYarnContainers / allocatedYarnContainers.size();
}
Here is output from jps:
14594 NameNode
15269 DataNode
17975 Jps
14666 ResourceManager
14702 NodeManager
And here is AppMaster log for initialization and 4 container requests:
23:47:09 YarnAppMaster - Starting AppMaster Client OK
23:47:09 YarnAppMaster - Register AppMaster on: andrei-mbp.local/192.168.1.4...
23:47:09 YarnAppMaster - Register AppMaster OK
23:47:09 YarnAppMaster - Max mem capabililty of resources in this cluster 2048
23:47:09 YarnAppMaster - Max vcores capabililty of resources in this cluster 2
23:47:09 YarnAppMaster - Received 0 previous AM's running containers on AM registration.
23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0]
23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0>
23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0]
23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0>
23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0]
23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0>
23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0]
23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0>
23:47:11 YarnAppMaster - Progress indicator should not be negative
Thanks in advance.
I suspect the problem comes exactly from the negative progress:
23:47:11 YarnAppMaster - Progress indicator should not be negative
Note that, since you are using the AMRMAsyncClient, requests are not made immediately when you call addContainerRequest. There is actually an heartbeat function which is run periodically and it is in this function that allocate is called and the pending requests will be made. The progress value used by this function initially starts at 0 but is updated with the value returned by your handler once a response from the acquire is obtained.
The first acquire is supposedly done right after the register so the getProgress function should be called then and update the existing progress. As it is, your progress will be updated to NaN because, at this time, allocatedYarnContainers will be empty and completedYarnContainers will also be 0 and so your returned progress will be the result of 0/0 which is not defined. It just so happens that when the next allocate checks your progress value, it will fail because NaNs return false in all comparisons and so no other allocate function will actually communicate with the ResourceManager because it quits right at that first step with an exception.
Try changing your progress function to the following:
#Override
public float getProgress() {
return (float) allocatedYarnContainers.size() / 4.0f;
}
(note: copied to StackOverflow for posteriority from here)
Thanks to Alexandre Fonseca for pointing out that getProgress() returns a NaN for division by zero when it's called before the first allocation which makes the ResourceManager to quit immediately with an exception.
Read more about it here.
Related
DropwizardMetricServices#submit() I'm using doesn't submit the gauge metric for second time.
i.e. My use-case is to remove the gauge metric from JMX after reading it. And my application can send the same metric (with different value).
For the first time the gauge metric is submitted successfully (then my application removes it once it reads the metric). But, the same metric is not submitted the second time.
So, I'm a bit confused what would be the reason for DropwizardMetricServices#submit() not to work for the second time?
Below is the code:
Submit metric:
private void submitNonSparseMetric(final String metricName, final long value) {
validateMetricName(metricName);
metricService.submit(metricName, value); // metricService is the DropwizardMetricServices
log(metricName, value);
LOGGER.debug("Submitted the metric {} to JMX", metricName);
}
Code that reads and removes the metric:
protected void collectMetrics() {
// Create the connection
Long currTime = System.currentTimeMillis()/1000; // Graphite needs
Socket connection = createConnection();
if (connection == null){
return;
}
// Get the output stream
DataOutputStream outputStream = getDataOutputStream(connection);
if (outputStream == null){
closeConnection();
return;
}
// Get metrics from JMX
Map<String, Gauge> g = metricRegistry.getGauges(); // metricRegistry is com.codahale.metrics.MetricRegistry
for(Entry<String, Gauge> e : g.entrySet()){
String key = e.getKey();
if(p2cMetric(key)){
String metricName = convertToMetricStandard(key);
String metricValue = String.valueOf(e.getValue().getValue());
String metricToSend = String.format("%s %s %s\n", metricName, metricValue, currTime);
try {
writeToStream(outputStream, metricToSend);
// Remove the metric from JMX after successfully sending metric to graphite
removeMetricFromJMX(key);
} catch (IOException e1) {
LOGGER.error("Unable to send metric to Graphite - {}", e1.getMessage());
}
}
}
closeOutputStream();
closeConnection();
}
I think I found the issue.
As per the DropwizardMetricServices doc - https://docs.spring.io/spring-boot/docs/current/api/org/springframework/boot/actuate/metrics/dropwizard/DropwizardMetricServices.html#submit-java.lang.String-double- ,
submit() method Set the specified gauge value.
So, I think it's recommended to use DropwizardMetricServices#submit() method to only set the values of any existing gauge metric in JMX and not for adding any new metric to JMX.
So, once I replaced DropwizardMetricServices#submit() with MetricRegistry#register() (com.codahale.metrics.MetricRegistry) method to submit all my metrics it worked as expected and my metrics are readded to JMX (once they were removed by my application).
But, I'm just wondering what makes DropwizardMetricServices#submit() to only add new metrics to JMX and not any metric that's already been removed (from JMX). Does DropwizardMetricServices cache (in memory) all the metrics submitted to JMX? that makes DropwizardMetricServices#submit() method not to resubmit the metric?
I'm having a weird behavior with the isReachable method of InetAddress class.
Method prototype is :
public boolean isReachable(int timeout)
When using a timeout > 1500 (ms), the method waits the exact time
given as argument (if the target IP is not reachable of course...).
When using timeout < 1500, the method waits 1000ms maximum...
The code is quite simple :
InetAddress addr = null;
String ip = "10.48.2.169";
try {
addr = InetAddress.getByName(ip);
} catch (UnknownHostException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Timestamp s = new Timestamp(System.currentTimeMillis());
System.out.println(s + "\t Starting tests :");
pingTest(addr, 100);
pingTest(addr, 500);
pingTest(addr, 1000);
pingTest(addr, 1500);
pingTest(addr, 2000);
pingTest(addr, 2500);
Where pingTest is defined by :
public static void pingTest(InetAddress addr, int timeout) {
boolean result = false;
try {
result = addr.isReachable(timeout);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Timestamp s = new Timestamp(System.currentTimeMillis());
System.out.println(s + "\t (" + timeout + ") " + addr.toString() + " " + result);
}
Then the output is :
2017-09-07 16:45:41.573 Starting tests :
2017-09-07 16:45:42.542 (100) /10.48.2.169 false
2017-09-07 16:45:43.542 (500) /10.48.2.169 false
2017-09-07 16:45:44.541 (1000) /10.48.2.169 false
2017-09-07 16:45:46.041 (1500) /10.48.2.169 false
2017-09-07 16:45:48.041 (2000) /10.48.2.169 false
2017-09-07 16:45:50.541 (2500) /10.48.2.169 false
So the question is : Is there a minimum timeout to InetAddress isReachable method ? (like 1500 in my doubt, but I doubt, huge timeout...)
Or maybe I just made a huge mistake that I still miss...
Tell me if this isn't clear enough.
Thanks for your help and thoughts.
First you should notice that the behavior of INetAddress.isReachable is not the same on each platform supported by Java. I will assume you work on Windows.
When undocumented behavior happens you should always look at the source if they are available. The java.net implementation for windows is here for the OpenJDK (it should be quite similar for the Oracle JVM, but I am not sure of this).
What we saw in the isReachable method implementation is:
they don't rely on ping because they find the Windows ICMP protocol implementation too unreliable
they pass the timeout value to the NET_Wait function
So the isReachable method doesn't perform a ping and we need to check what the NET_Wait do with the timeout to understand why a less than 1 second timeout isn't possible.
The NET_Wait function is defined here: src/windows/native/java/net/net_util_md.c
It consist in an infinite loop which break when these events occurs during the select function call:
NET_WAIT_CONNECT on the socket file descriptor (socket is connected to the remote host)
The timeout ends
The select function is documented in a man page you may consult here. This man page tells us that the timeout can "be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount".
This is why there is no guarantee on the minimal timeout value. Also, I think that the documentation doesn't state any minimal timeout value because the implementation differs on OSs supported by the JVM.
Hope this helps you understanding why.
However, to achieve a wanted timeout you may test the reachability in a separate task. You wait until the task returns the result, or if you wait more than your timeout you cancel the task or ignore its results.
I'm wondering if any one experienced the same problem.
We have a Vert.x application and in the end it's purpose is to insert 600 million rows into a Cassandra cluster. We are testing the speed of Vert.x in combination with Cassandra by doing tests in smaller amounts.
If we run the fat jar (build with Shade plugin) without the -cluster option, we are able to insert 10 million records in about a minute. When we add the -cluster option (eventually we will run the Vert.x application in cluster) it takes about 5 minutes for 10 million records to insert.
Does anyone know why?
We know that the Hazelcast config will create some overhead, but never thought it would be 5 times slower. This implies we will need 5 EC2 instances in cluster to get the same result when using 1 EC2 without the cluster option.
As mentioned, everything runs on EC2 instances:
2 Cassandra servers on t2.small
1 Vert.x server on t2.2xlarge
You are actually running into corner cases of the Vert.x Hazelcast Cluster manager.
First of all you are using a worker Verticle to send your messages (30000001). Under the hood Hazelcast is blocking and thus when you send a message from a worker the version 3.3.3 does not take that in account. Recently we added this fix https://github.com/vert-x3/issues/issues/75 (not present in 3.4.0.Beta1 but present in 3.4.0-SNAPSHOTS) that will improve this case.
Second when you send all your messages at the same time, it runs into another corner case that prevents the Hazelcast cluster manager to use a cache of the cluster topology. This topology cache is usually updated after the first message has been sent and sending all the messages in one shot prevents the usage of the ache (short explanation HazelcastAsyncMultiMap#getInProgressCount will be > 0 and prevents the cache to be used), hence paying the penalty of an expensive lookup (hence the cache).
If I use Bertjan's reproducer with 3.4.0-SNAPSHOT + Hazelcast and the following change: send message to destination, wait for reply. Upon reply send all messages then I get a lot of improvements.
Without clustering : 5852 ms
With clustering with HZ 3.3.3 :16745 ms
With clustering with HZ 3.4.0-SNAPSHOT + initial message : 8609 ms
I believe also you should not use a worker verticle to send that many messages and instead send them using an event loop verticle via batches. Perhaps you should explain your use case and we can think about the best way to solve it.
When you're you enable clustering (of any kind) to an application you are making your application more resilient to failures but you're also adding a performance penalty.
For example your current flow (without clustering) is something like:
client ->
vert.x app ->
in memory same process eventbus (negletible) ->
handler -> cassandra
<- vert.x app
<- client
Once you enable clustering:
client ->
vert.x app ->
serialize request ->
network request cluster member ->
deserialize request ->
handler -> cassandra
<- serialize response
<- network reply
<- deserialize response
<- vert.x app
<- client
As you can see there are many encode decode operations required plus several network calls and this all gets added to your total request time.
In order to achive best performance you need to take advantage of locality the closer you are of your data store usually the fastest.
Just to add the code of the project. I guess that would help.
Sender verticle:
public class ProviderVerticle extends AbstractVerticle {
#Override
public void start() throws Exception {
IntStream.range(1, 30000001).parallel().forEach(i -> {
vertx.eventBus().send("clustertest1", Json.encode(new TestCluster1(i, "abc", LocalDateTime.now())));
});
}
#Override
public void stop() throws Exception {
super.stop();
}
}
And the inserter verticle
public class ReceiverVerticle extends AbstractVerticle {
private int messagesReceived = 1;
private Session cassandraSession;
#Override
public void start() throws Exception {
PoolingOptions poolingOptions = new PoolingOptions()
.setCoreConnectionsPerHost(HostDistance.LOCAL, 2)
.setMaxConnectionsPerHost(HostDistance.LOCAL, 3)
.setCoreConnectionsPerHost(HostDistance.REMOTE, 1)
.setMaxConnectionsPerHost(HostDistance.REMOTE, 3)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 20)
.setMaxQueueSize(32768)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 20);
Cluster cluster = Cluster.builder()
.withPoolingOptions(poolingOptions)
.addContactPoints(ClusterSetup.SEEDS)
.build();
System.out.println("Connecting session");
cassandraSession = cluster.connect("kiespees");
System.out.println("Session connected:\n\tcluster [" + cassandraSession.getCluster().getClusterName() + "]");
System.out.println("Connected hosts: ");
cassandraSession.getState().getConnectedHosts().forEach(host -> System.out.println(host.getAddress()));
PreparedStatement prepared = cassandraSession.prepare(
"insert into clustertest1 (id, value, created) " +
"values (:id, :value, :created)");
PreparedStatement preparedTimer = cassandraSession.prepare(
"insert into timer (name, created_on, amount) " +
"values (:name, :createdOn, :amount)");
BoundStatement timerStart = preparedTimer.bind()
.setString("name", "clusterteststart")
.setInt("amount", 0)
.setTimestamp("createdOn", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(timerStart);
EventBus bus = vertx.eventBus();
System.out.println("Bus info: " + bus.toString());
MessageConsumer<String> cons = bus.consumer("clustertest1");
System.out.println("Consumer info: " + cons.address());
System.out.println("Waiting for messages");
cons.handler(message -> {
TestCluster1 tc = Json.decodeValue(message.body(), TestCluster1.class);
if (messagesReceived % 100000 == 0)
System.out.println("Message received: " + messagesReceived);
BoundStatement boundRecord = prepared.bind()
.setInt("id", tc.getId())
.setString("value", tc.getValue())
.setTimestamp("created", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(boundRecord);
if (messagesReceived % 100000 == 0) {
BoundStatement timerStop = preparedTimer.bind()
.setString("name", "clusterteststop")
.setInt("amount", messagesReceived)
.setTimestamp("createdOn", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(timerStop);
}
messagesReceived++;
//message.reply("OK");
});
}
#Override
public void stop() throws Exception {
super.stop();
cassandraSession.close();
}
}
I'm experiencing java.net.ConnectException in random ways.
My servlet runs in Tomcat 6.0 (JDK 1.6).
The servlet periodically fetches data from 4-5 third-party web servers.
The servlet uses a ScheduledExecutorService to fetch the data.
Run locally, all is fine and dandy. Run on my prod server, I see semi-random failures to fetch data from 1 of the third parties (Canadian weather data).
These are the URLs that are failing (plain RSS feeds):
http://weather.gc.ca/rss/city/pe-1_e.xml
http://weather.gc.ca/rss/city/pe-2_e.xml
http://weather.gc.ca/rss/city/pe-3_e.xml
http://weather.gc.ca/rss/city/pe-4_e.xml
http://weather.gc.ca/rss/city/pe-5_e.xml
http://weather.gc.ca/rss/city/pe-6_e.xml
http://meteo.gc.ca/rss/city/pe-1_f.xml
http://meteo.gc.ca/rss/city/pe-2_f.xml
http://meteo.gc.ca/rss/city/pe-3_f.xml
http://meteo.gc.ca/rss/city/pe-4_f.xml
http://meteo.gc.ca/rss/city/pe-5_f.xml
http://meteo.gc.ca/rss/city/pe-6_f.xml
Strange: each cycle, when I periodically fetch this data, the success/fail is all over the map: some succeed, some fail, but it never seems to be the same twice. So, I'm not completely blocked, just randomly blocked.
I slowed down my fetches, by introducing a 61s pause between each one. That had no effect.
The guts of the code that does the actual fetch:
private static final int TIMEOUT = 60*1000; //msecs
public String fetch(String aURL, String aEncoding /*UTF-8*/) {
String result = "";
long start = System.currentTimeMillis();
Scanner scanner = null;
URLConnection connection = null;
try {
URL url = new URL(aURL);
connection = url.openConnection(); //this doesn't talk to the network yet
connection.setConnectTimeout(TIMEOUT);
connection.setReadTimeout(TIMEOUT);
connection.connect(); //actually connects; this shouldn't be needed here
scanner = new Scanner(connection.getInputStream(), aEncoding);
scanner.useDelimiter(END_OF_INPUT);
result = scanner.next();
}
catch (IOException ex) {
long end = System.currentTimeMillis();
long time = end - start;
fLogger.severe(
"Problem connecting to " + aURL + " Encoding:" + aEncoding +
". Exception: " + ex.getMessage() + " " + ex.toString() + " Cause:" + ex.getCause() +
" Connection Timeout: " + connection.getConnectTimeout() + "msecs. Read timeout:" +
connection.getReadTimeout() + "msecs."
+ " Time taken to fail: " + time + " msecs."
);
}
finally {
if (scanner != null) scanner.close();
}
return result;
}
Example log entry showing a failure:
SEVERE: Problem connecting to http://weather.gc.ca/rss/city/pe-5_e.xml Encoding:UTF-8.
Exception: Connection timed out java.net.ConnectException: Connection timed out
Cause:null
Connection Timeout: 60000msecs.
Read timeout:60000msecs.
Time taken to fail: 15028 msecs.
Note that the time to fail is always 15s + a tiny amount.
Also note that it fails to reach the configured 60s timeout for the connection.
The host-server admins (Environment Canada) state that they don't have any kind of a blacklist for the IP address of misbehaving clients.
Also important: the code had been running for several months without this happening.
Someone suggested that instead I should use curl, a bash script, and cron. I implemented that, and it works fine.
I'm not able to solve this problem using Java.
We have a problem. Our customers are complaining that they are getting duplicate emails in their in-box. Some days up to 5 or 6 instances of the exact same email at the exact same time. We don't understand why. The code has been re-written at least once but the problem persists.
I'll try to explain this... but it's a bit complicated :O(
Every night (early morning) we want to send our users a daily report containing usage stats. So we have a cron job:
<cron>
<url>/redacted/report/url</url>
<description>Send out daily reports to active subscribers</description>
<schedule>every 2 hours</schedule>
</cron>
The cron job hits the servlet get method:
protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
AccountFilter filter = AccountFilter.forWebSafeName(req.getParameter("filter"));
createTasks(filter, null);
}
Which calls the createTasks method with a null cursor:
private void createTasks(AccountFilter accountFilter, String cursor) {
try {
PagedResults<Account> pagedAccounts = accountRepository.getAccounts(accountFilter.getFilter(), 50, cursor);
createTaskBatch(pagedAccounts);
// If there are still more results in cursor, then send cursor back to this servlet's doPost method so we don't hit the request time limit
if (pagedAccounts.getCursor() != null) {
getQueue(QUEUE_NAME).add(withUrl(WORKER_URL).param(CURSOR_KEY, pagedAccounts.getCursor()).param(FILTER_KEY, accountFilter.getWebSafeName()));
}
} catch(Exception ex) {
logger.log(Level.WARNING, "Problem creating daily report task batch for filter " + accountFilter.getWebSafeName(), ex);
}
}
which grabs 50 accounts and iterates over them creating new queued jobs for the emails that should be sent at this time. There is code to explcitely check the last report sent timestamp and update the timestamp BEFORE creating the new queued task. This should err on the side of not sending the report rather than sending duplicates:
private void createTaskBatch(PagedResults<Account> pagedAccounts) {
// GAE datastore query might return duplicate results?!
List<Account> list = pagedAccounts.getResults();
Set<Account> noDuplicates = new HashSet<>(list);
int dups = list.size() - noDuplicates.size();
if ( dups > 0 ){
logger.warning ("Accounts paged results contained " + dups + " duplicates!");
}
for (Account account : noDuplicates) {
try {
if (lastReportSentOver12HoursAgo(account)) {
List<Parent> parents = parentRepository.getVerifiedParentsForAccount(account.getId());
if (eitherParentSubscribed(parents)) {
List<AccountUser> users = accountUserRepository.listUsers(account.getId());
List<Device> devices = getUserDevices(account, users);
if (!devices.isEmpty()) {
DateTimeZone tz = getMostCommonTimezone(devices);
if ( null == tz ){
logger.warning("No timezone found for account: " + account.getId() );
}
else{
// Send early in the morning as the report contains the previous day's stats
if (now(tz).getHourOfDay() < 7) {
// mark sent now because queue might not be processed for a while
// and the next cursor set might contain some of the same accounts
accountRepository.markReportSent(account.getId(), now());
getQueue(QUEUE_NAME).add(withUrl(DailyReportServlet.WORKER_URL).param(DailyReportServlet.ACCOUNT_ID, account.getId()).param(DailyReportServlet.COMMON_TIMEZONE, tz.getID()));
}
}
}
}
}
} catch(Exception ex) {
logger.log(Level.WARNING, "Problem creating daily report task for " + account.getId(), ex);
}
}
}
The servlet POST method takes care of handling the follow up pages of results via the cursor method:
public void doPost(HttpServletRequest req, HttpServletResponse resp) throws IOException {
AccountFilter accountFilter = AccountFilter.forWebSafeName(req.getParameter(FILTER_KEY));
logger.log(Level.INFO, "doPost hit from task queue with filter " + accountFilter.getWebSafeName());
String cursor = req.getParameter(CURSOR_KEY);
createTasks(accountFilter, cursor);
}
There is another servlet that handles each report task and it just creates the email contents and calls send on the com.sendgrid.SendGrid class.
The eventual consistency in Datastore seems a likely candidate but that should be resolved within a few seconds and I don't see how that would account for both the number of customers complaining and the number of duplicates that some customers see.
Help! Any ideas? Are we being dumb somewhere?
UPDATED
For clarity... the email send task queue ends up in this method which does catch exceptions and reports them back to us. We don't see an exception for the duplicate cases:
private void sendReport(Account account, DateTimeZone tz) throws IOException, EntityNotFoundException {
try {
boolean sent = false;
Map<String, Object> root = buildEmailData(account, tz);
for (Parent parent : parentRepository.getVerifiedParentsForAccount(account.getId())) {
if (parent.getEmailPreferences().isSubscribedReports()) {
emailBuilder.send(account, parent, root, "report", EmailSender.NOTIFICATION);
sent = true;
}
}
if ( sent ){
accountRepository.markReportSent(account.getId(), now());
}
} catch (Exception ex) {
String message = "Problem building report email for account " + account.getId();
logger.log(Level.WARNING, message, ex);;
new TeamNotificationEvent( message + " : exception: " + ex.getMessage()).fire();
throw new IOException(message, ex);
}
}
UPDATE 2 AFTER ADDING EXTRA DEBUG LOGGING
I see two POSTS in at the same time to the same task queue with the same cursor:
09:35:08.397 2015-04-30 200 0 B 3.78s /ws/notification/daily-report-task-creator
0.1.0.2 - - [30/Apr/2015:01:35:08 -0700] "POST /ws/notification/daily-report-task-creator HTTP/1.1" 200 0 "http://screentimelabs.appspot.com/ws/notification/daily-report-task-creator" "AppEngine-Google; (+http://code.google.com/appengine)" "screentimelabs.appspot.com" ms=3782 cpu_ms=662 queue_name=dailyReports task_name=8168414365365326983 instance=00c61b117c33a909790f0d1882657e04f40b2c7e app_engine_release=1.9.20
09:35:04.618 com.screentime.service.taskqueue.reports.DailyReportTaskCreatorServlet createTasks: createTasks called for filter: ACTIVE with cursor: E-ABAIICO2oQc35zY3JlZW50aW1lbGFic3InCxIHQWNjb3VudCIaamFybW8ua2Fya2thaW5lbkBnbWFpbC5jb20MiAIAFA
09:35:08.432 2015-04-30 200 0 B 8.84s /ws/notification/daily-report-task-creator
0.1.0.2 - - [30/Apr/2015:01:35:08 -0700] "POST /ws/notification/daily-report-task-creator HTTP/1.1" 200 0 "http://screentimelabs.appspot.com/ws/notification/daily-report-task-creator" "AppEngine-Google; (+http://code.google.com/appengine)" "screentimelabs.appspot.com" ms=8837 cpu_ms=1348 queue_name=dailyReports task_name=50170612326424582061 instance=00c61b117c2bffe8de313e96fea8aeb813f4b20f app_engine_release=1.9.20 trace_id=7e5c0348382e66cf4e2c6ba400529fb7
09:34:59.608 com.screentime.service.taskqueue.reports.DailyReportTaskCreatorServlet createTasks: createTasks called for filter: ACTIVE with cursor: E-ABAIICO2oQc35zY3JlZW50aW1lbGFic3InCxIHQWNjb3VudCIaamFybW8ua2Fya2thaW5lbkBnbWFpbC5jb20MiAIAFA
Searching for 1 particular account id I see these requests:
09:35:08.397 2015-04-30 200 0 B 3.78s /ws/notification/daily-report-task-creator
09:35:08.432 2015-04-30 200 0 B 8.84s /ws/notification/daily-report-task-creator
09:35:08.443 2015-04-30 200 0 B 6.73s /ws/notification/daily-report-task-creator
09:35:10.541 2015-04-30 200 0 B 4.03s /ws/notification/daily-report-task-creator
09:35:10.690 2015-04-30 200 0 B 11.09s /ws/notification/daily-report-task-creator
09:35:13.678 2015-04-30 200 0 B 862ms /ws/notification/daily-report-worker
09:35:13.829 2015-04-30 500 0 B 1.21s /ws/notification/daily-report-worker
09:35:14.677 2015-04-30 200 0 B 1.56s /ws/notification/daily-report-worker
09:35:14.961 2015-04-30 200 0 B 346ms /ws/notification/daily-report-worker
Some have repeated cursor values.
I will make a guess because i dont see the task queue code. Its likely that you are not handling errors correctly in the task queue. If a task queue finishes with an error, gae will re-queue it. thus if some emails were already sent, the task will still run again. you need a way to remember what you already processed in the task queue so a retry wont reprocess those.