I have batch steps, where I read the file, process the file and save it to DB. But as it is a full load, hence I have to remove the table from the DB before inserting new records.
My question is that what place is best for writing the code to delete the existing table in spring batch (reader, processor, writer), below are the multiple scenarios:
Do this in open() method of ItemReader<> in reader class: Problem: if somehow the file which I'm reading is corrupt or blank, then in that case I will end up with empty table.
Create a flag, set it once and on basis of that flag delete the table in processor class: This can be done, but is there any other better way or better pace to do this
Create another temp table, copy all the records from file to this table and then in ´#AfterStep´ method, delete all the records from actual table and move all the records from temp table to this table.
Is there any method which just gets called once before the Itemprocessor, anything other then doing it using a flag? Please suggest
It sounds to me like you have 2 steps here:
Truncate existing table.
Insert data from file.
You need to decide which step should execute first.
If you want to execute the 'truncate' step first, you can add checks to validate the file before performing the truncate. For example, you could use a #BeforeStep to check that the file exists, is readable, and is not size 0.
If you need to guarantee that the entire file is parsed without error before loading the database table, then you will need to parse the data into some temporary location as you mention and then in a second step move the data from the temporary location to the final table. I see a few options there:
Create a temporary table to hold the data as you suggest. After reading the file, in a different step, truncate the target table and move from the temp table to the target table.
Add a 'createdDateTime' column or something similar to the existing table and an 'executionDate' job parameter. Then insert your new rows as they are parsed. In a second step, delete any rows that have a created time that is less than the executionDate (This assumes you are using a generated ID for a PK on the table).
Add a 'status' column to the existing table. Insert the new rows as 'pending'. Then in a second step, delete all rows that are 'active' and update rows that are 'pending' to 'active'.
Store the parsed data in-memory. This is dangerous for a few reasons; especially if the file is large. This also removes the ability to restart a failed job as the data in memory would be lost on failure.
You can:
Create 2 tables Data and DataTemp with the same schema
Have table Data with existing data
Copy the new data on DataTemp
If the new incoming data stored on DataTemp is valid, then you can transfer it to table Data.
If you don't have so much data you could do this transfer in a single transaction on another Tasklet.
If you have a lot of data you could use a chunk processing tasklet. As you have inserted it, you know it is not correpted and respect the db contraints, so if the transfer failed (probably because db is down), you can restart it later without loosing any data.
You can also avoid that transfer part by using 2 tables, and using in your configuration which table of the 2 is supposed to be used (active table). Or you could use a table alias/view that you update to reference the active table.
Related
Just started out with my hobby project and now I am here to get help with making the correct database design/query. I have made a simple Java program that loops trough the content of a folder. I want to save this content to a MySQL database, so I added a connector to my database in Java, created a table and the columns "file", "path" and "id, "date" in MySQL.
So now to the important/fun thing, every time I want to add the filenames to the MySQL in Java I do this (when the GUI-button is pressed I call on a method that does):
DELETE all entries with the same file path - this is to ensure that I will get new entries which is exactly the same as the content in the path.
Java-loop: INSERT the file-info into the columns id, path, filename and date when the file was added to the database.
In this way I can always ensure that the filenames that are going to be added into the database always are up to date, it doesn't matter if I rename a file or remve it, it will be up to date since the table will get it's entries deleted and new info will be written. Old info -> DELETE old info - INSERT new info -> Up-to-date.
I know this is probably not the best solution but it works, but now I am stuck on the next thing I want to do. I want to add the difference of the files in order to know which files has been added and deleted between two inserts, and here is my problem, since the entries are deleted before a new INSERT I cannot compare. How would you change the design or the solution? All ideas are welcome and since I am so fresh I would really appreciate if you could show me how the query could look like.
Do not remove all rows first. Remove only the ones that are removed (or event better, just mark them "inactive" as I suggest below). Query your DB first, to see what was there last time.
I would maintain additional column in your table called "inactive". It will be FALSE as default, and TRUE for removed files. Please keep in mind that as your file is uniquely identified by file+path+id renaming any file is indeed an operation of deleting the old one and creating the new one.
Removing things from DB is not a good idea, as you might always remove something by accident (bug in the code) and would not be able to get the data back.
Additional thing to do is adding the hash to your table. This way you will be able to check if the file was really changed. There is no need to re-add the file to the DB is it is not changed. See Getting a File's MD5 Checksum in Java for more info.
One way to achieve this is to implement auditing of your table. A common approach is to create a copy of the table where you are storing the folder contents and name that table using a convention to indicate it is storing audit information (eg. _AUD) . You then add additional columns to the AUD table, like "REV" (revision), "REV_TYPE" (inserted, deleted, modified). Whenever you insert, update or delete any rows from your main table, you insert a row into the AUD table to describe what you've done. Then you can find the operations associated with each revision by looking it up in the AUD table. A java framework that provides this feature is hibernate envers (http://hibernate.org/orm/envers/).
I am currently trying to push some data from Java into BigQuery, in order to be able to make use of it only for the next query and then to get rid of it.
The data consists of 3 columns which exists in the table I want to query. Therefore, by creating a temporary table containing this data, I could make a left join and get the query results I am in need of.
This process will happen on a scheduled basis, having different sets of data.
Can you please tell me if that can be done ?
Many thanks !
Using the jobs.query API, you can specify a destinationTable as part of configuration.query. Does that help? You can control the table expiration time using the tables.update API and setting expirationTime.
Alternatively, you can "inline" the table as part of the query that you want to run using a WITH clause in standard SQL rather than writing to a temporary table.
I have a word2vec model stored in text file as
also -0.036738 -0.062687 -0.104392 -0.178325 0.010501 0.049380....
one -0.089568 -0.191083 0.038558 0.156755 -0.037399 -0.013798....
The size of the text file is more than 8GB.
I want to read this file into mysql database using the first word as key (in a column) and the rest of the line as another column. Is it possible to do so without reading each line and splitting it?
I went through some related questions but it didn't match what I want.
How to read a file and add its content to database?
read text file content and insert it into a mysql database
you can do it by:
making a simple for loop that iterates over the records on the model
aggregating about 100 records on an array
using mysql bulk insert feature to insert 100s of records at once
use a fast language like go if you can.
This thing you are trying to do, it's very possible, let me know if you need code for this.
I am uploading a CSV file using a Servlet and inserting it in an Oracle table using JDBC. I need to insert only the unique values in the CSV file. If the values in CSV file is already in the database table, then it should not be inserted in the table. So it should insert only unique values from the CSV file.
These options should help avoid an additional DB call to handle this situation.
Option 1: Very Simple and needs least coding ... but works when there is no global transaction boundary.
Just get all the inserts going, in case of any Constraint exception, just catch it and do "nothing", loop to another value
Option 2: Every time you read row from CSV, add it to a collection, before adding, just check if the object already exists (ex: arrayList.contains (object instance)) and continue adding only when there is no object with similar data). At the end, do a bulk insert.
Note: If the data is large, go for fixed set of data for bulk insert.
Consider these steps:
Read a value from the CSV file
Code to search a "value" against the database and if it not found then insert it.
i'm guessing the way ur inserting data in the database, is in a loop, you read from the csv and insert in the DB, so why not simply do a select satetement to check if the value exists and if it does don't insert it
we have a source table which is updated from various external systems. i require the destination table (in different server) to be in sync with this source table. the destination table is not an exact replica of the source table, some data processing has to be done before the data is inserted/updated into destination table.
i have thought of the following logic
every 15min we run this java consumer code which fetches the records where the timestamp is created than that of previous update and stored in a CachedRowSet. and call a stored procedure with CachedRowSet as parameter, where the data processing is done and the data is inserted/updated into the destination table.
do you believe above mentioned is an efficient way as we are dealing over a million records every update ?
also when a record is deleted in the source table in would not be replicated in the above method ! can you suggest what to do in such scenario
Something similar to technique used by database for save point and rollback.
Whenever there is some change in the source table e.g. CRUD. keep the scripts of change as per format required to the target table. periodically you can push those changes to the target server. As your source table is updated by various external system, you'll need to have trigger on your source table for keeping script logs.
You might want to check out mk-table-sync from Maatkit tools:
http://www.maatkit.org/doc/mk-table-sync.html
You'd need to be careful around your table differences.
Here are some existing solutions:
https://www.symmetricds.org/
http://opensource.replicator.daffodilsw.com/