Avoid history table explosion for long-running cyclic processes? - java

I have a process which has a conditional loop iterating each 1 minute. The process itself can run for weeks, but I expect 99% of it's the history to be a repeating entries reflecting invocation of executions which are the part of the said cycle. Example:
|15:34:30.167|15:34:30.238|TimerCatchEvent|
|15:34:30.258|15:34:30.323|CheckConditionServiceTask|
|15:34:30.371|15:34:30.410|ExclusiveGateway1|
|15:34:30.457|15:34:30.501|ReturningtoTimerEventServiceTask|
|15:35:30.167|15:35:30.238|TimerCatchEvent|
|15:35:30.258|15:35:30.323|CheckConditionServiceTask|
|15:35:30.371|15:35:30.410|ExclusiveGateway1|
|15:35:30.457|15:35:30.501|ReturningtoTimerEventServiceTask|
|15:36:30.167|15:36:30.238|TimerCatchEvent|
|15:36:30.258|15:36:30.323|CheckConditionServiceTask|
|15:36:30.371|15:36:30.410|ExclusiveGateway1|
|15:36:30.457|15:36:30.501|ReturningtoTimerEventServiceTask|
Is there any way to somehow collapse these repeating history entries on the camunda level? Or maybe someone had come up with other solutions to this problem?
P.S. This is a follow-up for the question on cross-process synchronization: Cross-process synchronization in Camunda? - I have implemented what I need using post-fact timer-based "are all ready to sync?"-check.

In Camunda the History is Event Driven. It is possible to
implement a custom HistoryLevel controlling the amount of events produced and the data they contain,
implement a custom History Backend which allows you to logg the events in a different way than the default handler does.
Maybe that is useful to you?

You could also delete history entries from the database using some clean-up-script. It is safe to delete information from the history - it will not affect the runtime behavior.
Cheers
Bernd

Related

Distributed, synchronized batch processing

In our current Java project, we need to batch process a huge set of records. Once, this processing is done, it must start again and process all records again. This processing must be parallelized as well as distributed among multiple nodes.
The records itself are stored in a database. Using some id range (e.g. 1-10000) for identifying a batch would be sufficient.
From a high level perspective, I see the following steps:
A sub task processes one batch of records.
A master task checks if any sub task is still running. If not, create one sub task for each batch of records.
We use MongoDB quite heavily and thought of persisting sub tasks in it. Then, each node can pick up sub tasks that are not done yet, does the processing and marks the record as done. Once there are no undone subtasks, the master task creates all the sub tasks again. This would probably work, but we are looking for a solution in which we don't need to do the heavy synchronization work ourselves.
Could this be a possible use-case for akka?
Can akka-persistence be used to synchronize the processing among different nodes?
Are there any other Java/JVM frameworks suited for this job?
Your question is way too broad for SO's format. Plase read this guide in the future before asking, and don't ask your group members to vote your question up just to inflate what is obviously an ill-posed question ( ͡° ͜ʖ ͡°).
Anyways:
1) Yes, you can implement your requirements in Akka. In particular, since you mentioned multiple nodes, you are looking at the akka-cluster module (for inter-node communication), and you might also need akka-cluster-sharding (in case you want to keep all data in memory beside during processing).
2) No, I would strongly not reccomend that. While you could technically force your problem into using akka-persistence for synchronizing the tasks, the goal of akka-persistence is simply to make an actor's state persistent. Akka itself in its basic form is enough for handling all your synchronization issues. Simply have a master actor create a worker for every subtask and monitor its completion.
3) Yes. Note that the answer to this question is always yes no matter which job.

Concurrency : Handling multiple submits in a web application

This is a recent interview question to my friend:
How would you handle a situation where users enter some data in the screen and let's say 5 of them clicked on the Submit button *the SAME time ?*
(By same time,the interviewer insisted that they are same to the level of nanoseconds)
My answer was just to make the method that handles the request synchronized and only one request can acquire the lock on the method at a given time.
But it looks like the interviewer kept insisting there was a "better way" to handle it .
One other approach to handle locking at the database level, but I don't think it is "better".
Are there any other approaches. This seems to be a fairly common problem.
If you have only one network card, you can only have one request coming down it at once. ;)
The answer he is probably looking for is something like
Make the servlet stateless so they can be executed concurrently.
Use components which allow thread safe concurrent access like Atomic* or Concurrent*
Use locks only where you obsolutely have to.
What I prefer to do is to make the service so fast it can respond before the next resquest can come in. ;) Though I don't have the overhead of Java EE or databases to worry about.
Does it matter that they click at the same time e.g. are they both updating the same record on a database?
A synchronized method will not cut it, especially if it's a webapp distributed amongst multiple JVMs. Also the synchronized method may block, but then the other threads would just fire after the first completes and you'd have lost writes.
So locking at database level seems to be the option here i.e. if the record has been updated, report an error back to the users whose updates were serviced after the first.
You do not have to worry about this as web server launches each request in isolated thread and manages it.
But if you have some shared resource like some file for logging then you need to achieve concurrency and put thread lock on it in request and inter requests

How to know when updates to the Google AppEngine HRD datastore are complete?

I have a long running job that updates 1000's of entity groups. I want to kick off a 2nd job afterwards that will have to assume all of those items have been updated. Since there are so many entity groups, I can't do it in a transaction, so i've just scheduled the 2nd job to run 15 minutes after the 1st completes using task queues.
Is there a better way?
Is it even safe to assume that 15 minutes gives a promise that the datastore is in sync with my previous calls?
I am using high replication.
In the google IO videos about HRD, they give a list of ways to deal with eventual consistency. One of them was to "accept it". Some updates (like twitter posts) don't need to be consistent with the next read. But they also said something like "hey, we're only talking miliseconds to a couple of seconds before they are consistent". Is that time frame documented anywhere else? Is it safe assuming that waiting 1 minute after a write before reading again will mean all my preivous writes are there in the read?
The mention of that is at the 39:30 mark in this video http://www.youtube.com/watch?feature=player_embedded&v=xO015C3R6dw
I don't think there is any built in way to determine if the updates are done. I would recommend adding a lastUpdated field to your entities and updating it with your first job, then check for the timestamp on the entity you're updating with the 2nd before running... kind of a hack but it should work.
Interested to see if anybody has a better solution. Kinda hope they do ;-)
This is automatic as long as you are getting entities without changing the consistency to Eventual. The HRD puts data to a majority of relevant datastore servers before returning. If you are calling the asynchronous version of put, you'll need to call get on all the Future objects before you can be sure it's completed.
If however you are querying for the items in the first job, there's no way to be sure that the index has been updated.
So for example...
If you are updating a property on every entity (but not creating any entities), then retrieving all entities of that kind. You can do a keys-only query followed by a batch get (which is approximately as fast/cheap as doing a normal query) and be sure that you have all updates applied.
On the other hand, if you're adding new entities or updating a property in the first process that the second process queries, there's no way to be sure.
I did find this statement:
With eventual consistency, more than 99.9% of your writes are available for queries within a few seconds.
at the bottom of this page:
http://code.google.com/appengine/docs/java/datastore/hr/overview.html
So, for my application, a 0.1% chance of it not being there on the next read is probably OK. However, I do plan to redesign my schema to make use of ancestor queries.

Java logging across multiple threads

We have a system that uses threading so that it can concurrently handle different bits of functionality in parallel. We would like to find a way to tie all log entries for a particular "transaction" together. Normally, one might use 'threadName' to gather these together, but clearly that fails in a multithreaded situation.
Short of passing a 'transaction key' down through every method call, I can't see a way to tie these together. And passing a key into every single method is just ugly.
Also, we're kind of tied to Java logging, as our system is built on a modified version of it. So, I would be interested in other platforms for examples of what we might try, but switching platforms is highly unlikely.
Does anyone have any suggestions?
Thanks,
Peter
EDIT: Unfortunately, I don't have control over the creation of the threads as that's all handled by a workflow package. Otherwise, the idea of caching the ID once for each thread (on ThreadLocal maybe?) then setting that on the new threads as they are created is a good idea. I may try that anyway.
You could consider creating a globally-accessible Map that maps a Thread's name to its current transaction ID. Upon beginning a new task, generate a GUID for that transaction and have the Thread register itself in the Map. Do the same for any Threads it spawns to perform the same task. Then, when you need to log something, you can simply lookup the transaction ID from the global Map, based on the current Thread's name. (A bit kludgy, but should work)
This is a perfect example for AspectJ crosscuts. If you know the methods that are being called you can put interceptors on them and bind dynamically.
This article will give you several options http://www.ibm.com/developerworks/java/library/j-logging/
However you mentioned that your transaction spans more than one thread, take a look at how log4j cope with binding additional information to current thread with MDC and NDC classes. It uses ThreadLocal as you were advised before, but interesting thing is how log4j injects data into log messages.
//In the code:
MDC.put("RemoteAddress", req.getRemoteAddr());
//In the configuration file, add the following:
%X{RemoteAddress}
Details:
http://onjava.com/pub/a/onjava/2002/08/07/log4j.html?page=3
http://wiki.apache.org/logging-log4j/NDCvsMDC
How about naming your threads to include the transaction ID? Quick and Dirty, admittedly, but it should work (until you need the thread name for something else or you start reusing threads in a thread pool).
If you are logging, then you must have some kind of logger object. You should have a spearate instance in each thread.
add a method to it called setID(String id).
When it is initialized in your thread, set a unique ID using the method.
prepend the set iD to each log entry.
A couple people have suggested answers that have the newly spawned thread somehow knowing what the transaction ID is. Unless I'm missing something, in order to get this ID into the newly spawned thread, I would have to pass it all the way down the line into the method that spawns the thread, which I'd rather not do.
I don't think you need to pass it down, but rather the code responsible for handing work to these threads needs to have the transactionID to pass. Wouldn't the work-assigner have this already?

Patterns/Principles for thread-safe queues and "master/worker" program in Java

I have a problem which I believe is the classic master/worker pattern, and I'm seeking advice on implementation. Here's what I currently am thinking about the problem:
There's a global "queue" of some sort, and it is a central place where "the work to be done" is kept. Presumably this queue will be managed by a kind of "master" object. Threads will be spawned to go find work to do, and when they find work to do, they'll tell the master thing (whatever that is) to "add this to the queue of work to be done".
The master, perhaps on an interval, will spawn other threads that actually perform the work to be done. Once a thread completes its work, I'd like it to notify the master that the work is finished. Then, the master can remove this work from the queue.
I've done a fair amount of thread programming in Java in the past, but it's all been prior to JDK 1.5 and consequently I am not familiar with the appropriate new APIs for handling this case. I understand that JDK7 will have fork-join, and that that might be a solution for me, but I am not able to use an early-access product in this project.
The problems, as I see them, are:
1) how to have the "threads doing the work" communicate back to the master telling them that their work is complete and that the master can now remove the work from the queue
2) how to efficiently have the master guarantee that work is only ever scheduled once. For example, let's say this queue has a million items, and it wants to tell a worker to "go do these 100 things". What's the most efficient way of guaranteeing that when it schedules work to the next worker, it gets "the next 100 things" and not "the 100 things I've already scheduled"?
3) choosing an appropriate data structure for the queue. My thinking here is that the "threads finding work to do" could potentially find the same work to do more than once, and they'd send a message to the master saying "here's work", and the master would realize that the work has already been scheduled and consequently should ignore the message. I want to ensure that I choose the right data structure such that this computation is as cheap as possible.
Traditionally, I would have done this in a database, in sort of a finite-state-machine manner, working "tasks" through from start to complete. However, in this problem, I don't want to use a database because of the high volume and volatility of the queue. In addition, I'd like to keep this as light-weight as possible. I don't want to use any app server if that can be avoided.
It is quite likely that this problem I'm describing is a common problem with a well-known name and accepted set of solutions, but I, with my lowly non-CS degree, do not know what this is called (i.e. please be gentle).
Thanks for any and all pointers.
As far as I understand your requirements, you need ExecutorService. ExecutorService have
submit(Callable task)
method which return value is Future. Future is a blocking way to communicate back from worker to master. You could easily expand this mechanism to work is asynchronous manner. And yes, ExecutorService also maintaining work queue like ThreadPoolExecutor. So you don't need to bother about scheduling, in most cases. java.util.concurrent package already have efficient implementations of thread safe queue (ConcurrentLinked queue - nonblocking, and LinkedBlockedQueue - blocking).
Check out java.util.concurrent in the Java library.
Depending on your application it might be as simple as cobbling together some blocking queue and a ThreadPoolExecutor.
Also, the book Java Concurrency in Practice by Brian Goetz might be helpful.
First, why do you want to hold the items after a worker started doing them? Normally, you would have a queue of work and a worker takes items out of this queue. This would also solve the "how can I prevent workers from getting the same item"-problem.
To your questions:
1) how to have the "threads doing the
work" communicate back to the master
telling them that their work is
complete and that the master can now
remove the work from the queue
The master could listen to the workers using the listener/observer pattern
2) how to efficiently have the master
guarantee that work is only ever
scheduled once. For example, let's say
this queue has a million items, and it
wants to tell a worker to "go do these
100 things". What's the most efficient
way of guaranteeing that when it
schedules work to the next worker, it
gets "the next 100 things" and not
"the 100 things I've already
scheduled"?
See above. I would let the workers pull the items out of the queue.
3) choosing an appropriate data
structure for the queue. My thinking
here is that the "threads finding work
to do" could potentially find the same
work to do more than once, and they'd
send a message to the master saying
"here's work", and the master would
realize that the work has already been
scheduled and consequently should
ignore the message. I want to ensure
that I choose the right data structure
such that this computation is as cheap
as possible.
There are Implementations of a blocking queue since Java 5
Don't forget Jini and Javaspaces. What you're describing sounds very like the classic producer/consumer pattern that space-based architectures excel at.
A producer will write the jobs into the space. 1 or more consumers will take out jobs (under a transaction) and work on that in parallel, and then write the results back. Since it's under a transaction, if a problem occurs the job is made available again for another consumer .
You can scale this trivially by adding more consumers. This works especially well when the consumers are separate VMs and you scale across the network.
If you are open to the idea of Spring, then check out their Spring Integration project. It gives you all the queue/thread-pool boilerplate out of the box and leaves you to focus on the business logic. Configuration is kept to a minimum using #annotations.
btw, the Goetz is very good.
This doesn't sound like a master-worker problem, but a specialized client above a threadpool. Given that you have a lot of scavenging threads and not a lot of processing units, it may be worthwhile simply doing a scavaging pass and then a computing pass. By storing the work items in a Set, the uniqueness constraint will remove duplicates. The second pass can submit all of the work to an ExecutorService to perform the process in parallel.
A master-worker model generally assumes that the data provider has all of the work and supplies it to the master to manage. The master controls the work execution and deals with distributed computation, time-outs, failures, retries, etc. A fork-join abstraction is a recursive rather than iterative data provider. A map-reduce abstraction is a multi-step master-worker that is useful in certain scenarios.
A good example of master-worker is for trivially parallel problems, such as finding prime numbers. Another is a data load where each entry is independant (validate, transform, stage). The need to process a known working set, handle failures, etc. is what makes a master-worker model different than a thread-pool. This is why a master must be in control and pushes the work units out, whereas a threadpool allows workers to pull work from a shared queue.

Categories