Legacy purge job hangs due to multi cascade - java

We initially missed migration of legacy scheduled "purge" job (Java based) to cloud. Now, when we have done so, the job always hangs, due to its original design of cascade deletes (or even regular ones) of 15 or so tables for each user identity.
This job runs well for few users, but because of initial miss, we ended up with 1000s of users that need purge (with associated records in multiple tables). Hence the first run is causing the job to run for hours, and it finally hangs.
Few approaches were tried (creating indexes, using chunk of size 50 etc), but none of them have so far worked.
Because this job works well for few users (which is likely scenario going forward), trying to see if approach of creating some kind of script / mechanism to delete users in small batches (of say 5), iteratively and have it executed by DBA. Once this is complete (all applicable users are purged), enabled the legacy purge job with its original design, which should work for deleting few users going forward.
Appreciate any suggestions/thoughts.

Related

How to schedule jobs / tasks

I do not know really what I am looking for and I am not asking for the best solution. I would like to ask for possible solutions to do what I am asking for.
I am building an application using Spring Boot.
I have a database with words in it. I am using these words on a site to search for products and would like to run them over and over again with some time between each search.
So basically the application ask the database for all words that has not been searched for within a specific time (let say 5 minutes). The words that I get in my query response I send out to a KafkaQueue and they are then processed by workers. As soon as they have been used by a worker I make an update in the database with the current time.
So I make a search in the database like every 1 minute (or more often) to find the ones that has not been used for 5 min and then runs them again.
This gives me a lot of connections to the database and I was thinking if there is a better solution. The workers are also saving other data to other tables in the database to.
It is like 80 - 90 words that are turned over like every 5 minutes.
I had a thought to pick some of them and then send them to like a scheduler to set them to run in the remaining of the time until 5 minutes have passed.
If I schedule like 20 tasks at the time will this effect the memory a lot?
Currently I am using postgresql but maybe this is not the best DB for this kind of execution?
I will be able to remove and/or add new words so I do not know if it is possible to even use a in-memory database for the words.
Sounds like a scheduler from SpringBoot is what you need. Look into using the #Scheduled annotation. The following link should provide you with everything Scheduled Annotation .
I ended up using JobRunr which I was able to schedule different delays for each task I set up. So far it has been really smoth using it and it has been working out fine for me.
Scheduled annotation in Spring Boot works really fine to but not for me when I wanted to have more flexibility in setting up different delays for each task.

Concurrent calls to a custom plugin processed 1 at a time

I developed a plugin of my own in Neo4j in order to speed the process of inserting node. Mainly because I needed to insert node and relationship only if they didn't exists before which can be too slow using the REST API.
If I try to call my plugin a 100 time, inserting roughly 100 nodes and 100 relationship each time, it take approximately 350ms on each call. Each call is inserting different nodes, in order to rule out locking cause.
However if I parallelize my calls (2, 3 , 4... at time), the response time drop accordingly to the parallelism degree. It takes 750ms to insert my 200 objects when I do 2 call at a time, 1000ms when I do 3 etc.
I'm calling my plugin from a .NET MVC controller, using HttpWebRequest. I set the maxConnection to 10000, and I can see all the TCP connection opened.
I investigated a little on this issue but it seems very wrong. I must have done something wrong, either in my neo4j configuration, or in my plugin code. Using VisualVM I found out that the threads launched by Neo4j to handle my calls are working sequentially. See the picture linked.
http://i.imgur.com/vPWofTh.png
My conf :
Windows 8, 2 core
8G of RAM
Neo4j 2.0M03 installed as a service with no conf tuning
Hope someone will be able to help me. As it is, I will be unable to use Neo4j in production, where there will be tens of concurrent calls, which cannot be done sequentially.
Neo4j is transactional. Every commit triggers an IO operation on filesystem which needs to run in a synchronized block - this explains the picture you've attached. Therefore it's best practice to run writes single threaded. Any pre-processing prior can of course benefit from parallelizing.
In general for maximum performance go with the stable version (1.9.2 as of today). Early milestone builds are not optimized yet, so you might get a wrong picture.
Another thing to consider is the transaction size used in your plugin. 10k to 50k in a single transaction should give you best results. If your transactions are very small, transactional overhead is significant, in case of huge transactions, you need lots of memory.
Write performance is heavily driven by the performance of underlying IO subsystem. If possible use fast SSD drives, even better stripe then.

Quick and Dirty Solution to Load Balancing Batch Jobs

We're developing a web app and are coming to the end of development, and the client we're working with has suddenly sprung the fact on us that we will need to be able to handle load balancing.
We have batch jobs running which would need to run on both servers, but we don't want them to overlap. They are selecting rows from the database, processing the objects, and merging them back into the database. One of these jobs MUST run at the same time each day, though the others run every n minutes. We have about a week at most to get something working, and it'll become technical debt for us.
My question is: what quick and dirty hacks exist to get this working properly? We have a SQLServer 2008 instance and are running Java EE 6 on JBoss 5, which will be load balanced between two servers. We're using Spring 3 and JPA2 backed by Hibernate, and using the stock spring scheduler to schedule and run the jobs. Help me Obi Wan Kenobi; you're my only hope!
on jboss5 u need to use Scheduler API as the simplest solution - the implmentation is built on top of quartz and generally you would user clustered configuration like described here
http://quartz-scheduler.org/documentation/quartz-2.x/configuration/ConfigJDBCJobStoreClustering
Almost 10 years after this question was asked, I had the same need and the "quickest and dirty-ess" solution for me was a load balancer using shared file system without any master.
Each worker locks->picks the jobs from the shared file system, independent of other workers. To balance load, each worker sleeps X seconds between each job polling iteration, where X is proportional to load on the worker (in my case load is count of processes started by worker in the background). Thus high load sleeper gives higher probability to other workers to pick up the next job. Worker loops are running under supervisor (linux).
My use case was execution of sparklyr client-mode jobs on Spark/Hadoop cluster without overloading the edge nodes. It was implemented as a bash script within few hours and then scaled to 3 hosts, and has been stable for some months now - till there is time to invest in a better solution.

Concurrent periodic task running

I'm trying to find the best solution for periodic task running in parallel. Requirements:
Java (Spring w/o Hibernate).
Tasks are being managed by front-end application and stored in MySQL DB (fields: id, frequency (in seconds), <other attributes/settings about task scenario>). -- Something like crontab, only with frequency (seconds) field, instead of minutes/hours/days/months/days of weeks.
I'm thinking about:
TaskImporter thread polling Tasks from DB (via TasksDAO.findToProcess()) and submitting them to queue.
java.util.concurrent.ThreadPoolExecutor running tasks (from queue) in parallel.
The most tricky part of this architecture is TasksDAO.findToProcess():
How do I know which tasks is time to run right now?
I'm thinking about next_run Task field, which would be populated (UPDATE tasks SET next_run = TIMESTAMPADD(SECOND, NOW(), frequency) WHERE id = ? straight after selection (SELECT * FROM tasks WHERE next_run IS NULL OR next_run <= NOW() FOR UPDATE). The problem: Have to run lots of UPDATES for lots of SELECT'ed tasks (UPDATE for each Task or bulk UPDATE) + concurrency problems (see below).
Ability to run several concurrent processing applications (cloud), using/polling same DB.
All of the concurring processing applications must run concrete task only once. Must lock all SELECT's from all other apps, until app A finishes updating (next_run) of all selected tasks. The problem: locking production table (front-end app) would slow things down. Table mirror?
I love simple and clean solutions and believe there's a better way to implement this processing application. Do you see any? :)
Thanks in advance.
EDIT: Using Quartz as a scheduler/executor is not an option because of syncing latency. Front-end app is not in Java and so is not able to interact with Quartz, except Webservice-oriented solution, which is not an option too, because front-end app has more data associated with previously mentioned Tasks and needs direct access to all data in DB (read+write).
I would suggest using Scheduling API like Quartz rather than using Home grown implementation.
It provides lot of API for implementation of logic and convenience. You will also have better control over jobs.
http://www.quartz-scheduler.org/
http://www.quartz-scheduler.org/docs/tutorial/index.html

MySQL performance

I have this LAMP application with about 900k rows in MySQL and I am having some performance issues.
Background - Apart from the LAMP stack , there's also a Java process (multi-threaded) that runs in its own JVM. So together with LAMP & java, they form the complete solution. The java process is responsible for inserts/updates and few selects as well. These inserts/updates are usually in bulk/batch, anywhere between 5-150 rows. The PHP front-end code only does SELECT's.
Issue - the PHP/SELECT queries become very slow when the java process is running. When the java process is stopped, SELECT's perform alright. I mean the performance difference is huge. When the java process is running, any action performed on the php front-end results in 80% and more CPU usage for mysqld process.
Any help would be appreciated.
MySQL is running with default parameters & settings.
Software stack -
Apache - 2.2.x
MySQL -5.1.37-1ubuntu5
PHP - 5.2.10
Java - 1.6.0_15
OS - Ubuntu 9.10 (karmic)
What engine are you using for MySQL? The thing to note here is if you're using MyISAM, then you're going to have locking issues due to the table locking that engine uses.
From: MySQL Table Locking
Table locking is also disadvantageous
under the following scenario:
* A session issues a SELECT that takes a long time to run.
* Another session then issues an UPDATE on the same table. This session
waits until the SELECT is finished.
* Another session issues another SELECT statement on the same table.
Because UPDATE has higher priority than SELECT, this SELECT waits for the UPDATE to finish,
after waiting for the first SELECT to finish.
I won't repeat them here, but the page has some tips on increasing concurrency on a table within MySQL. Obviously, one option would be to change to an engine like InnoDB which has a more complex row locking mechanism that for high concurrency tables can make a huge difference in performance. For more info on InnoDB go here.
Prior to changing the engine though it would probably be worth looking at the other tips like making sure your table is indexed properly, etc. as this will increase select and update performance regardless of the storage engine.
Edit based on user comment:
I would say it's one possible solution based on the symptoms you've described, but it may not be
the one that will get you where you want to be. It's impossible to say without more information.
You could be doing full table scans due to the lack of indexes. This could be causing I/O contention
on your disk, which just further exasterbates the table locks used by MyISAM. If this is the case then
the root of the cause is the improper indexing and rectifying that would be your best course of action
before changing storage engines.
Also, make sure your tables are normalized. This can have profound implications on performance
especially on updates. Normalized tables can allow you to update a single row instead of hundreds or
thousands in an un-normalized table. This is due to unduplicated values. It can also save huge amounts
of I/O on selects as the db can more efficiently cache data blocks. Without knowing the structure of
the tables you're working with or the indexes you have present it's difficult to provide you with a
more detailed response.
Edit after user attempted using InnoDB:
You mentioned that your Java process is multi-threaded. Have you tried running the process with a single thread? I'm wondering if maybe it's possibly you're sending the same rows to update out to multiple threads and/or the way you're updating across threads is causing locking issues.
Outside of that, I would check the following:
Have you checked your explain plans to verify you have reasonable costs and that the query is actually using the indexes you have?
Are your tables normalized? More specifically, are you updating 100 rows when you could update a single record if the tables were normalized?
Is it possible that you're running out of physical memory when the Java process is running and the machine is busy swapping stuff in and out?
Are you flooding your disk (a single disk?) with more IOPs than it can reasonably handle?
We'd need to know a lot more about the system to say if thats normal or how to solve the problem.
with about 900k rows in MySQL
I would say that makes it very small - so if its performing badly then you're going seriously wrong somewhere.
Enable the query log to see exactly what queries are running, prioritize based on the product of frequency and duration. Have a look at the explain plans, create some indexes. Think about splitting the database across multiple disks.
HTH
C.

Categories