Multi-threaded Application - SQL Query for Duplicate Check

Multi-threaded Application - SQL Query for Duplicate Check - java

I'm just looking for high-level advice when dealing with an issue with a multi-threaded application.
Here's how it works:
The application takes in Alerts, which are then processed in different threads to make Reports. On occasion, two Alerts include the same Report, however that is not desired.
It is a Spring application, written in Java, using a MySQL DB.
I altered my code to run a SELECT SQL query before saving a Report which checks to see if a similar report is already there. If it exists, the Report is not generated. However, if two Alerts come in at the same time, the SELECT command is run for Report #2, before Report #1 is saved.
I thought about putting in a sleep() with a random wait time of 1-10 seconds, but it still would cause an issue when the two threads had the same random sleep time assigned.
I'm pretty new to multi-threading, so does anyone have any ideas? Or resources to point me in the right direction.
Thanks a lot!!

Assuming you have code that looks something like this:
Report report = getReport(...); // calls the DB to get a record to see if it already exists
if (report == null) {
insertReport(...); // add a record to DB which might have already been added by another thread
}
then to avoid collisions across threads (or JVMs) combine the SELECT and INSERT. For example:
insertReportIfNotAlreadyExists(...);
which uses a query structured as:
INSERT INTO REPORTS (...) VALUES (...)
WHERE NOT EXISTS (...)
with the NOT EXISTS clause SELECTing for the record to make sure it doesn't already exist.

Related

Memory Leak with Pentaho Kettle Looping?

I have an ETL requirement like:
I need to fetch around 20000 records from a table and process each record separately.(The processing of each record involves a couple of steps like creating a table for each record and inserting some data into it). For prototype I implemented it with two Jobs(with corresponding transformations). Rather than table I created a simple empty file. But this simple case also doesn't seem to work smoothly. (When I do create a table for each record the Kettle exits after 5000 reocrds)
Flow
When I run this the Kettle goes slow and then hangs after 2000-3000 files though processing is complete after a long time though Kettle seems to stop at some time. Is my design approach right?. When I replace the write to file with actual requirement like creating a new table(through sql script step) for each id and inserting data into it, the kettle exits after 5000 records. What do I need to do so that the flow works. increasing the Java memory(Xmx is already at 2gb)?. Is there any other configuration I can change? Or is there any other way? Extra Time shouldn't be a constraint but the flow should work.
My initial guess was since we are not storing any data the prototype atleast should work smoothly. I am using Kettle 3.2.

I seem to remember this is a known issue/restriction, hence why job looping is deprecated these days.
Are you able to re-build the job using the transformation and/or job executor steps? You can execute any number of rows via those stops.
These steps have their own issues - namely you have to explicitly handle errors, but it's worth a try just to see if you can achieve what you want. It's a slightly different mindset, but a nicer way to build loops than the job approach.

Duplicate Records in Database

I am having this problem for the last two projects that i worked on. Both of the projects are written in Java and use Oracle 11g as DB. When i look at the code there is nothing wrong in transaction management etc. The flow is very simple and like this in code.
Connection con = null;
try {
//Get connection
//Run validation
//Insert record
//Commit
} catch() {
//Rollback
} finally {
//Close connection
}
The validation part checks for some business rules and prevents dublicate entries.
1.st case
This works fine when a user calls this part of code fully and commits the current transaction, only after that another user comes. In this case when another user wants to run this code because that the other transaction committed the changes validation part can see the record and prevents duplicate.
But when two user runs the same code at the same time sometimes duplicate records occurs. The flow is like below and i have no idea how to handle it. I've looked at isolation levels etc but none of them works for this case. The only one applicable is using unique constraint but it is not suitable for the projects.
user1 passes validation
user2 passes validation
user1 insert record
user2 insert record
2.nd case
Another case is totaly bizarre and i can't reproduce it in my tests but i witnessed it in production. When the system load is high the system creates duplicate records on a single click of a user. That means the user presses the button only one time but the system creates multiple records at the background. These records have different ids but nearly exact creation times and all the other values are the same.
We thought initially that when the system load is high the application server couldn't handle it properly (because it was an old unsopperted one) and because it happened rarely we leave it there. But after sometime later we ha to change the application server to another one for another reason and the problem persist. And the second project i mentioned has a totaly different application server.
I and two different team worked on these problems for weeks but we couldn't find a suitable solution for these two cases and we couldn't even find the reason for the second one. Any help would be welcome if you guys encountered something like this or know the solution.

You need to use Synchronization on an object to avoid duplicates. Probably the RUN VALIDATION block might be a good candidate for fixing this but it really depends on your application logic.
It has nothing to do with your Webserver you need to use an Idempotent HTTP method to submit your form.

Not allow DML operations during Packages exec

i need a little help here because i'm struggling a little bit to find the best solution for my problem. i googled and dont have any enlightening answer.
So, first of all, i'll explain the idea.
1 - i've a java application that insert data in my database (Oracle DB) using jdbc.
2 - My database is logically splited in two. One part that contains table with exported information (from another application) and another part with table that represents some reports.
3 - my java app only insert information in export table.
4 - I've developed some packages that makes the transformation of data from export table to report table (generate some reports).
5 - This packages are scheduled to execute 2, 3 times a day
So, my problem is that when transformation task starts, i want to prevent new DML operations. Then, when transformation stops, all new data that was supposed to be inserted/updated during that time, shall be inserted again in the export tables.
i tought in two approaches:
1 - during transformation time deviate the DML ops to temporary table
2 - lock the tables but i've not so many experience using this. My main question is, can i force DML operations in jdbc to wait until the lock is finished? Not tried yet, but read here and there that after some that is thrown a lockwaittimeout exception or something like that.
Can anyone more experienced give me some advices?
Any doubts on what i'm trying to do just ask.

Do not try locking tables as a solution. Sadly, that is common but rarely necessary. Just a few ideas:
at start of transformation select * data from export table into global_temp table. Then execute your transformation packages on that temp table
create a materialized view like select * data from export table. Investigate the options to refresh on commit but it seems you require to refresh the table just before your transformation
analyze your exported data. If it is like many other cases most of the data will never change once imported. Only new data needs to be analyzed. To aid in processing add a timestamp field called date_last_modified and a trigger on the table. When a row is updated then update the date_last_modified. This allows you to choose the smallest data set possible of "only changed records"
you should also investigate using bulk collect to optimize your cursor. This will allow you get a group of records all at once, sort of a snapshot of the data at a point in time
I believe you are over thinking this. If you get a group of records one at a time then Oracle will get the state of the record as of the last commit by any user. If you bulk collect a group of records they go into memory and will, again, represent the state as of a point in time.
The best way to feel more comfortable about this is to set up a test case. Set up a cursor that sleeps during every processing cycle. Open another session and change the data that is being processed. See what happens....

Table rows seem to be disapearing

I have a ton of raw html files that I'm parsing and inserting to a MySQL database via a connection in Java.
I'm using "REPLACE INTO" statements and this method:
public void migrate(SomeThread thread) throws Exception{
PreparedStatement threadStatement = SQL.prepareStatement(threadQuery);
thread.prepareThreadStatement(threadStatement);
threadStatement.executeUpdate();
threadStatement.close();
for(SomeThread.Post P : thread.threadPosts){
PreparedStatement postStatement = SQL.prepareStatement(postQuery);
P.preparePostStatement(postStatement);
postStatement.executeUpdate();
postStatement.close();
}
}
I am running 3 separate instances of my program each in its own command prompt, with their own separate directory of htmls to parse and commit.
I'm using HeidiSQL to monitor the database and a funny thing is happening where I'll see that I have 500,000 rows in a table at one point for example, then I'll close HeidiSQL and check back later to find that I now have 440,000 rows. The same thing occurs for the two tables that I'm using.
Both of my tables use a primary key called "id", each of their ID's have their own domain but it's possible their values overlap and are overwriting each other? I'm not sure if this could be an issue because I'd think SQL would differentiate between the table's "local" id values.
Otherwise I was thinking it could be that since I'm running 3 separate instances that each have their connection to the DB, some kind of magic is happening where right as one row is being committed, the execution swaps to another commit statement, displaces the table, then back to the first commit and then some more magic that causes the database to roll back the number of rows collected.
I'm pretty new to SQL so I'm not too sure where to start, if somebody has an idea about what the heck is going on and could point me in the right direction I'd really appreciate it.
Thanks

You might want to use INSERT INTO instead of REPLACE INTO.
Data doesn't disappear.
Here are some tips:
Do you have another thread running that actually deletes entries?
Do other people have access to the database?
Not sure what HeidiSQL may do. To exclude that possibility maybe use MySQL Workbench instead.

Yeah now that I run a COUNT(*) query against my tables I see that all my rows are in fact there.
Most likely the heidiSQL summary page is just a very rough estimate.
Thanks for the suggestion to use workbench pete, I will try it and see if it is better than Heidi as Heidi is freezing up on me on a regular basis.

How can I know how many queries are fired in my database?

I have an employee management application. I am using a MySQL database.
In my application, I have functionality like add /edit/delete /view.
Whenever I run any functionality, one query is fired in the database. Like in add employee, it will fire the insert query.
So I want to do something on my database, so that I see how many queries have been fired till date.
I don't want to do any changes on my Java code.

You can use SHOW STATUS:
SHOW GLOBAL STATUS LIKE 'Questions'
As documented under Server Status Variables:
The status variables have the following meanings.
[ deletia ]
Questions
The number of statements executed by the server. This includes only statements sent to the server by clients and not statements executed within stored programs, unlike the Queries variable. This variable does not count COM_PING, COM_STATISTICS, COM_STMT_PREPARE, COM_STMT_CLOSE, or COM_STMT_RESET commands.
Beware that:
the statistics are reset when FLUSH STATUS is issued.
the SHOW STATUS command is itself a statement and will increment the Questions counter.
these statistics are server-wide and therefore will include other databases on the same server (if any exist)—a feature request for per-database statistics has been open since January 2006; in the meantime one can obtain per-table statistics from google-mysql-tools/UserTableMonitoring.

You should execute queries as mentioned below:
To get the SELECT query count, execute Show global status like 'com_select';
To get the UPDATE query count, execute Show global status like 'com_update';
To get the DELETE query count, execute Show global status like 'com_delete';
To get the INSERT query count, execute Show global status like 'com_insert';
You can also analyze the general log or route your application via a MySQL proxy to get all queries executed on a server.

If you don't want to modify your code then you can trace this on the database with triggers. The restriction is that triggers can only fire on insert/update/delete so can't be used to count reads (selects).

Maybe it's too "enterprise" and too "production" for your question.
When you use munin (http://munin-monitoring.org/) (other monitoring-tools have simular extenstions), you can use mysql-monitoring tools which show you how many requests (splitted in Insert/Update/Loaddata/...) you are firing.
With these tools, you see the usage and the load you are producing.
Especially when data changes, and may cause more accesses/load (missing indices, more queries because of big m:n-tables, ...) you recognize it.
It's extremely handy and you can do the check during your break. No typing, no thing, just check the graphs.

I think that the most exact method, which needs no modifications to the database or application in order to operate, would be to configure your database management system to log all events.
You are left with a log file, which is a text file that can be analyzed on demand.
Here is the The General Query Log manual page that will get you started.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.