Status: solved
I had to make a pastebin as I had to point out line numbers.
note: not using executorsService or thread pools. just to understand that what is wrong in starting and using threads this way. If I use 1 thread. the app works Perfect!
related links:
http://www.postgresql.org/docs/9.1/static/transaction-iso.html
http://www.postgresql.org/docs/current/static/explicit-locking.html
main app, http://pastebin.com/i9rVyari
logs, http://pastebin.com/2c4pU1K8 , http://pastebin.com/2S3301gD
I am starting many threads (10) in a for loop with instantiating a runnable class but it seems I am getting same result from db (I am geting some string from db, then changing it) but with each thread, I get same string (despite each thread changed it.) . using jdbc for postgresql what might be the usual issues ?
line 252
and line 223
the link is marked as processed. (true) in db. other threads of crawler class also do it. so when line 252 should get a link. it should be processed = false. but I see all threads take same link.
when one of the threads crawled the link . it makes it processed = true. the others then should not crawl it. (get it) is its marked processed = true.
getNonProcessedLinkFromDB() returns a non processed link
public String getNonProcessedLink(){ line 645
public boolean markLinkAsProcesed(String link){ line 705
getNonProcessedLinkFromDB will see for processed = false links and give one out of them . limit 1
each thread has a starting interval gap of 20 secs.
within one thread. 1 or 2 seconds (estimate processing time for crawling)
line 98 keepS threads from grabbing the same url
if you see the result. one thread made it true. still others access it. waaaay after some time.
all thread are seperate. even one races. the db makes the link true at the moment the first thread processes it
This is a situation of not a concise question being asked. There is lots of code in there and you have no idea what is going on. You need to break it down so that you can understand where it is going wrong, then show us that bit.
Some things of potential conflict.
You are opening a database connections for almost every process. The normal flow of an application is to open a few connections, do some processing, then close them.
Are you handling database commits? I don't remember what the default setting is for a postres database, you'll have to look into it.
There are 3 states a single url is in. Unprocessed, being processed, processed. I don't think you are handling the 'being processed' state at all. Because being processed takes time and may fail, you have to account for those situations.
I did not read the logs because they are useless to me.
-edit for comment-
Databases generally have transactions. Modifications you make in one transaction are not seen in other transactions until they are committed. Transaction can be rolled back. You'll need to look into fetching the row you just updated and see if the value has really changed. Do this in another transaction or on another connection.
The gap of 20 seconds looks like it is only when the process is started. Imagine a situation where Thread1 processes URL1 and Thread2 processes URL2. They both finish at about the same time. They both look for the next unprocessed URL (say URL3). They would both start processing this Url because they don't know another thread has started it. You need one process handing out the Url, possibly a queue is what you'd want to look at.
Logging might be improved if you knew which threads were working on which URLs. You also need a smaller sample size so that you can get your head around what is going on.
Despite the comments and response by helpers in this post were also correct.
at the start of crawl() method body.
synchronized(Crawler.class){
url = getNonProcessedLinkFromDB();
new BasicDAO().markLinkAsProcesed(url);
}
and at the bottom of crawl() method body (when it has done processing):
crawl(nonProcessedLinkFromDB);
actually solved the issue.
It was the gap between marking a link processed true and fetching a new one and letting other threads get the same link while the current was working on it.
Synchonized block helped further.
Thanks to helper. "Fuber" on IRC channels. Quakenet servers #java and Freenode servers ##javaee
and ALL who supported me!
Related
I know this subject has been discussed here before, and we have utilized past conversations to attempt to resolve the DbMaxReadersExceededException that we are still experiencing. We are using version 2.5.1 of ObjectBox. We are also, heavily, using RxJava threads while manipulating our BoxStore DB. At any moment in time, potentially a handful of RxJava threads are running, accessing the DB. Threads are constantly spawning, executing and terminating.
This is a very "non-standard" use of Android. Our App is running on a non-cell phone device, that sits on a wall and is expected to be available 24x7. 95% of the RxJava threads that access the BoxStore DB are short lived, get in / get out threads, that retrieve information and present to the device user. We do have a few longer lived background RxJava threads, that talk to an external DB over the internet to keep the local DB up to date. But these threads to spawn, execute and terminate. Theses threads run in the background at regular intervals. These background threads are not associated with a Fragment nor Activity; therefore the common way of cleaning up, using a CompositeDisposable, is not utilized.
We are seeing that readers are accumulating, despite many attempts to resolve the situation. We have also noticed that threads, that have run to termination, marked as isAlive and appear to be part of the RxJava thread pool, also accumulate. We have observed this using Thread.getAllStackTraces() and printing out this information regularly. Separate issue I am not trying to resolve with this post (I am concentrating on the DbMaxReadersExceededException issue, but they may be related).
The readers accumulate as the result of .find() calls on a Query that is build; based upon analysis of when a reader occurs. That is not surprising, but sometimes a .find() causes a new reader and sometimes it does not. I do not understand this behavior, and I am not sure if that is a telling sign or not. But it does result in the accumlation of active readers everytime the RxJava thread that accessed a given Box is invoked.
Any help / assistance offered will be greatly appreciated. Please ask any questions about anything that I may have accidental left out.
Things that we have tried, based upon other posts that I have read, include:
Collect Disposables from RxJava background threads and dispose
We have tried collecting the Disposable generated by the .subscribe() from these background threads, and added a timer to .dispose() of them sometime (5 seconds) after the thread that was using this object terminates (run to completion).
Utilized BoxStore.diagnose()
We have written code to utilize BoxStore.diagnose() to be able to periodically watch the reader accumulation.
Tried BoxStore.closeThreadResources()
We have added BoxStore.closeThreadResources() calls when an RxJava thread terminates to cleanup any BoxStore resources that may be active.
Tried Box.closeThreadResources()
We have tried adding Box.closeThreadResources() calls closer to when the Box is accessed in order to access and then clean up ASAP.
Tried breaking down .method() sequence and added .close() calls to itermediate objects
We have tried breaking down the .method() call sequence that terminates with the .find() call and then .close() or .closeThreadResources() the intermediate objects along the way.
Tried combinations of the above
We have tried a combination of all of the above.
Wrote method to be able to monitor RxJava threads using Thread.getAllStackTraces() - RxJava threads seem to accumulate
We have written a method that helps us monitor RxJava threads using Thread.getAllStackTraces().
We have tried to manually invoke the Garbage Collector
We added code, after the .dispose(), mentioned above, to cause a manual Garbage Collection (System.gc()).
As far as I know, we have tried every suggestion that I have seen posted on this and other forms, regarding this issue. We are at a loss as to what to do or try next. I did see something about a package called RxObjectBox, but I have not pursued this any further.
Should we:
Look at restructuring our RxJava thread access?
Do we need to look closer at RxObjectBox?
Is there a known problem with ObjectBox 2.5.1 that we should be using a later version?
What haven't we tried that we should?
I would like to make a question to the comunity and get as many feedbacks as possible about an strategy I have been thinking, oriented to resolve some issues of performance in my project.
The context:
We have an important process that perform 4 steps.
An entity status change and its persistence
If 1 ends OK. Entity is exported into a CSV file.
If 2 ends OK. Entity is exported into another CSV. This one with way more Info.
If 3 ends OK. The last CSV is sent by mail
Steps 1 and 2 are linked and they are critical.
Steps 3 and 4 are not critical. Doesn't even care if they ends successfully.
Performance of 1-2 is fine, but 3-4 in some escenarios are just insanely slow. Mostly cause step 3.
If we execute all the steps as a sequence, some times step 3 causes a timeout. Client do not get any response about steps 1 and 2 (the important ones) and user don't know whats going on.
This case made me think in JMS queues in order to delegate the last 2 steps to another app/process. Deallocate the notification from the business logic. Second export and mailing will be processed when posible and probably in parallel. I could also split it in 2 queues: exports, mail notification.
Our webapp runs into a WebLogic 11 cluster, so I could use its implementation.
What do you think about the strategy? Is WebLogic JMS implementation anything good? Should I check another implementation? ActiveMQ, RabbitMQ,...
I have also thinking on tiketing system implementation with spring-tasks.
At this point I have to point at spring-batch. Its usage is limited. We have already so many jobs focused on important processes of data consolidation and the window of time for allocation of more jobs is limited. Plus the impact of to try to process all items massively at once.
May be we could if we find out a way to use the multithread of spring-batch but we didn't find yet the way to fit oír requirements into such strategy.
Thank you in advance and excuse my english. I promise to keep working hard on it :-).
One problem to consider is data integrity. If step n fails, does step n-1 need to be reversed? Is there any ordering dependencies that you need to be aware of? And are you writing to the same or different CSV? If the same, then might have contention issues.
Now, back to the original problem. I would consider Java executors, using 4 fixed-sized pools and move the task through the pools as successes occur:
Submit step 1 to pool 1, getting a Future back, which will be used to check for completion.
When step 1 completes, you submit step 2 to pool 2.
When step 2 completes, you now can return a result to the caller. The call to this point has been waiting (likely with a timeout so it doesn't hang around forever) but now the critical tasks are done.
After returning to the client, submit step 3 to pool 3.
When step 3 completes, submit step to pool 4.
The pools themselves, while fixed sized, could be larger for pool 1/2 to get maximum throughput (and to get back to your client as quickly as possible) and pool 3/4 could be smaller but still large enough to get the work done.
You could do something similar with JMS, but the issues are similar: you need to have multiple listeners or multiple threads per listener so that you can process at an appropriate speed. You could do steps 1/2 synchronously without a pool, but then you don't get some of the thread management that executors give you. You still need to "schedule" steps 3/4 by putting them on the JMS queue and still have listeners to process them.
The ability to recover from server going down is key here, but Executors/ExecutorService has not persistence, so then I'd definitely be looking at JMS (and then I'd be queuing absolutely everything up, even the first 2 steps) but depending on your use case it might be overkill.
Yes, an event-driven approach where a message bus makes the integration sounds good. They are asynch so you will not have timeout. Of course you will need to use a Topic. WLS has some memory issues when you have too many messages in the server, maybe a different server would work better for separation of concerns and resources.
I need my application to submit data to a web service and wait for 90 sec for the response. But if there's a no response in 60 secs I need to redirect the user to a different page and continue to wait for the the response for another 30 secs if it come then process it.
I know I need to use thread for this but not sure how to integrate the treads in this case so threads can exchange data between themselves.
Any ideas?? I'm using JSF for UI.
The requirment is follows : The web service will send response in 90 secs (That's the maximum response time for it). But the user will be given a response(A dummy response in case the response does not come within 60 sec) in 60 sec. So even if the user has been given a dummy response (after 60 sec) my application will continue to wait for another 30 sec for the response
Don't know much about JSF, but it sounds like you want a timer, probably java.util.Timer. If the answer comes back before the timer goes off, shut down the timer. If the timer goes off, reset it for 30 seconds and redirect the user. The next time it goes off, give up waiting for the correct answer.
That much you seem to understand. But you've got at least two interacting threads threads here. How to communicate?
Just use instance fields. All references to them should be done with code in synchronized methods or blocks. Do that and you should be fine. You'll have to figure it out, but I would imagine you'd have an int timerPhase, that would indicate the timer was not started, in the first 60 seconds, in the next 30, or timed out. Also a boolean answerReceived, something with the answer in it, and perhaps a few others.
(Too much synchronization can slow your program down. I don't think you will have this problem. But if you do, split the synchronize blocks up, with each field synchronized separately unless they interact. Remove synchronization and use the volatile keyword. Read up on multithreading. (Read up on the volatile keyword.) Think real hard about how parallel threads can interact. And prepare for things to get real weird.)
I have a long running job that updates 1000's of entity groups. I want to kick off a 2nd job afterwards that will have to assume all of those items have been updated. Since there are so many entity groups, I can't do it in a transaction, so i've just scheduled the 2nd job to run 15 minutes after the 1st completes using task queues.
Is there a better way?
Is it even safe to assume that 15 minutes gives a promise that the datastore is in sync with my previous calls?
I am using high replication.
In the google IO videos about HRD, they give a list of ways to deal with eventual consistency. One of them was to "accept it". Some updates (like twitter posts) don't need to be consistent with the next read. But they also said something like "hey, we're only talking miliseconds to a couple of seconds before they are consistent". Is that time frame documented anywhere else? Is it safe assuming that waiting 1 minute after a write before reading again will mean all my preivous writes are there in the read?
The mention of that is at the 39:30 mark in this video http://www.youtube.com/watch?feature=player_embedded&v=xO015C3R6dw
I don't think there is any built in way to determine if the updates are done. I would recommend adding a lastUpdated field to your entities and updating it with your first job, then check for the timestamp on the entity you're updating with the 2nd before running... kind of a hack but it should work.
Interested to see if anybody has a better solution. Kinda hope they do ;-)
This is automatic as long as you are getting entities without changing the consistency to Eventual. The HRD puts data to a majority of relevant datastore servers before returning. If you are calling the asynchronous version of put, you'll need to call get on all the Future objects before you can be sure it's completed.
If however you are querying for the items in the first job, there's no way to be sure that the index has been updated.
So for example...
If you are updating a property on every entity (but not creating any entities), then retrieving all entities of that kind. You can do a keys-only query followed by a batch get (which is approximately as fast/cheap as doing a normal query) and be sure that you have all updates applied.
On the other hand, if you're adding new entities or updating a property in the first process that the second process queries, there's no way to be sure.
I did find this statement:
With eventual consistency, more than 99.9% of your writes are available for queries within a few seconds.
at the bottom of this page:
http://code.google.com/appengine/docs/java/datastore/hr/overview.html
So, for my application, a 0.1% chance of it not being there on the next read is probably OK. However, I do plan to redesign my schema to make use of ancestor queries.
I have an application that checks a resource on the internet for new mails. If there is are new mails it does some processing on them. This means that depending on the amount of mails it might take just a few seconds to hours of processing.
Now the object/program that does the processing is already a singleton. So right now I already took care of there really only being 1 instance that's handling the checking and processing.
However I only have it running once now and I'd like to have it continuously running, checking for new mails more or less every 10 minutes or so to handle them in a timely manner.
I understand I can take care of this with Timer/Timertask or even better I found a resource here: http://www.ibm.com/developerworks/java/library/j-schedule/index.html that uses Scheduler/SchedulerTask. But what I am afraid of.. is if I set it to run every 10 minutes and a previous session is already processing data it will put the new task in a stack waiting to be executed once the previous one is done. So what I'm afraid of is for instance the first run running for 5 hours and then, because it was busy all the time, after that it will launch 5*6-1=29 runs immediately after each other checking for mails and/do some processing without giving the server a break.
Does anyone know how I can solve this?
P.S. the way I have my application set up right now is I'm using a Java Servlet on my tomcat server that's launched upon server start where it creates a Singleton instance of my main program, then calls some method to do the fetching/processing. And what I want is to repeat that fetching/processing every "x" amount of time (10 minutes or so), making sure that really only 1 instance is doing this and that really after each run 10 minutes or so are given to rest.
Actually, Timer + TimerTask can deal with this pretty cleanly. If you schedule something with Timer.scheduleAtFixedRate() You will notice that the docs say that it will attempt to "make up" late events to maintain the long-term period of execution. However, this can be overcome by using TimerTask.scheduledExecutionTime(). The example therein lets you figure out if the task is too tardy to run, and you can just return instead of doing anything. This will, in effect, "clear the queue" of TimerTask.
Of note: TimerTask uses a single thread to execute, so it won't spawn two copies of your task side-by-side.
On the side note part, you don't have to process all 10k emails in the queue in a single run. I would suggest processing for a fixed amount of time using TimerTask.scheduledExecutionTime() to figure out how long you have, then returning. That keeps your process more limber, cleans up the stack between runs, and if you are doing aggregates, ensures that you don't have to rebuild too much data if, for example, the server is restarted in the middle of the task. But this recommendation is based on generalities, since I don't know what you're doing in the task :)