Get or Retrieve Generated PKs after a massive insert SQLLDR - java

I'll be direct about my situation right now. I'm working in a project which will perform a "Base load" procedure based on an excel (xlsx, xls) file. It has been developed in java with JDBC drivers. right now this project is working, It takes an excel file and based on a configuration It performances the insert into differents tables. The point is: It's taking too long doing the job, which makes it inefficient. (It takes around 2 hours inserting 3000 records on DB). in the future, this software will be inserted around 30k records and it will be painfully slow. So I need to improve its efficience and I was thinking in: Instead of inserting from java via JDBC drivers. I will generate control files and data files to be inserted in the DB using SQLLDR.
The point I'm facing right now, I need to insert these data into several tables, and this tables are related to each other. That's means, If I insert a person into "Person_table" I will need the Primary Key generated by a database sequence to insert the "Address, Phone, email, etc." into other table, so I do not know how to get the primary keys generated in the first insert via SQLLDR.
I'm not sure sure yet if SQLLDR is my best way to do this, but I guess It is, because the DBMS is Oracle
Can you guys lead me about how could I do what I explained you guys I need to do? any suggestion is welcome and well received. It does not matter if your suggestions are not about how to do this with SQLLDR.
I'm a kind of stuck at this point right now, I really appreciate the help you could give me.

SQL*Loader can't read native Excel files (at least, as far as I know). Therefore, you'll have to save the result as a CSV file.
As you need to manipulate foreign key constraints, consider switching to external tables feature - basically, the background is still SQL*Loader, but you can write (PL/)SQL against those files/tables (yes - a CSV file, stored on a hard disk, acts as if it was an Oracle table).
So, you'd "load" one table, populate primary key values, populate another (child) table - possibly into a "temporary" (not necessarily a global temporary table) which doesn't have any constraints enabled, populate foreign key values and move data into a "real" target table whose constraints now won't fail.
Possible drawback: CSV files have to reside in a directory that is accessible to the database server, as you'll have to create a directory (Oracle object) and grant required privileges (usually read, write) to user who will be using it. Directory is usually created on a server itself; if not, you'll have to use UNC while creating it.
Now you have something to read about/research; see if it makes sense to you.

Related

Add Column to Cassandra db with out losing data

I am using Cassandra database integrated into a spring boot application.
My Question is around the schema actions. If I need to make structural changes to the DB, say add a column to a table, the database needs to be recreated, however this means all the existing data gets deleted:
schema-action: CREATE_IF_NOT_EXISTS
The only way I have managed to solve this is by using the RECREATE scheme action, but as mentioned earlier, this results in data-loss.
What would be the best approach to handle this? To add structural changes such as a column name with out having to recreate the database and lose all existing data?
Thanks
Cassandra does allow you to modify the schema of an existing table without recreating it from scratch, using the ALTER TABLE statement via cqlsh. However, as explained in that link, there are some important limitations on the kind of changes you can do. You cannot modify the primary key of the table at all, you can add or delete regular columns, and you can't change the type of a column to a non-compatible one.
The reason for most of these limitations is how Cassandra needs to deal with the old data that already exists in the table. For example, it doesn't make sense to say that a column A that until now contained strings - will now contain integers - how are we supposed to handle all the old values in column A which weren't integers?
As Aaron rightly said in a comment, it is unlikely you'll want to do these schema changes as part of your application. These are usually rare operations which are done manually, or via some management application - not your usual application.

Looking for the right solution to store filenames

Just started out with my hobby project and now I am here to get help with making the correct database design/query. I have made a simple Java program that loops trough the content of a folder. I want to save this content to a MySQL database, so I added a connector to my database in Java, created a table and the columns "file", "path" and "id, "date" in MySQL.
So now to the important/fun thing, every time I want to add the filenames to the MySQL in Java I do this (when the GUI-button is pressed I call on a method that does):
DELETE all entries with the same file path - this is to ensure that I will get new entries which is exactly the same as the content in the path.
Java-loop: INSERT the file-info into the columns id, path, filename and date when the file was added to the database.
In this way I can always ensure that the filenames that are going to be added into the database always are up to date, it doesn't matter if I rename a file or remve it, it will be up to date since the table will get it's entries deleted and new info will be written. Old info -> DELETE old info - INSERT new info -> Up-to-date.
I know this is probably not the best solution but it works, but now I am stuck on the next thing I want to do. I want to add the difference of the files in order to know which files has been added and deleted between two inserts, and here is my problem, since the entries are deleted before a new INSERT I cannot compare. How would you change the design or the solution? All ideas are welcome and since I am so fresh I would really appreciate if you could show me how the query could look like.
Do not remove all rows first. Remove only the ones that are removed (or event better, just mark them "inactive" as I suggest below). Query your DB first, to see what was there last time.
I would maintain additional column in your table called "inactive". It will be FALSE as default, and TRUE for removed files. Please keep in mind that as your file is uniquely identified by file+path+id renaming any file is indeed an operation of deleting the old one and creating the new one.
Removing things from DB is not a good idea, as you might always remove something by accident (bug in the code) and would not be able to get the data back.
Additional thing to do is adding the hash to your table. This way you will be able to check if the file was really changed. There is no need to re-add the file to the DB is it is not changed. See Getting a File's MD5 Checksum in Java for more info.
One way to achieve this is to implement auditing of your table. A common approach is to create a copy of the table where you are storing the folder contents and name that table using a convention to indicate it is storing audit information (eg. _AUD) . You then add additional columns to the AUD table, like "REV" (revision), "REV_TYPE" (inserted, deleted, modified). Whenever you insert, update or delete any rows from your main table, you insert a row into the AUD table to describe what you've done. Then you can find the operations associated with each revision by looking it up in the AUD table. A java framework that provides this feature is hibernate envers (http://hibernate.org/orm/envers/).

Informix, MySQL and Oracle blob contains

We have an application that runs with any of IBM Informix, MySQL and Oracle, and we are using Java with Hibernate to connect to the database. We will store XML, CSV and other text-based files inside the database (clob column). The entities in Java are byte[] objects.
One feature request to the application is now to "grep" content inside the data. So I need to find all files with a specific content.
On regular char/varchar fields I can use like '%xyz%', but this is not working on byte[] / blobs.
The first approach was to load each entity, cast the byte[] into a string and use the contains method in Java. If the use enters any filter parameters on other (non-clob) columns, I will apply those filters before testing the clob in order to reduce the number of blobs I have to scan.
That worked quite well for 100 files (clobs) and as long as the application and database are on the same server. But I think it will get really slow if I have 1.000.000 files inside the database and the database is not always in the same network. So I think that is not a good idea.
My next thought was creating a database procedure. But I am not quite sure if this is possible for Informix, MySQL and Oracle. And I am not sure if this is possible.
The last but not favored method is to store the content of the data not inside a clob. Maybe I can use a different datatype for that?
Does anyone has a good idea how to realize that? I need a solution for all three DBMS. The application knows on what kind of DBMS it is connected to. So it would be okay, if I have three different solutions (one for each DBMS).
I am completely open to changing what kind of datatype I use (BLOB, CLOB ...) — I can modify that as I want.
Note: the clobs will range from about 5 KiB to about 500 KiB, with a maximum of 1 MiB.
Look into Apache Lucene or other text indexing library.
https://en.wikipedia.org/wiki/Lucene
http://en.wikipedia.org/wiki/Full_text_search
If you go with a DB specific solution like Oracle Text Search you will have to implement a custom solution for each database. I know from experience that Oracle Text search takes significant time to learn and involves a lot of tweaking to get just right.
Also, if you use a DB solution you would receive different results in each DB even if the data sets were the same (each DB would have it's own methods of indexing and retrieving the data).
By going with a 3rd party solution like Lucene -- you only have to learn one solution and results will be consistent regardless of the Db.

How to tell initial data load to insert only the values which are not there in target db?

i have some large data in one table and small data in other table,is there any way to run initial load of golden gate so that same data in both tables wont be changed and rest of the data got transferred from one table to other.
Initial loads are typically for when you are setting up the replication environment; however, you can do this as well on single tables. Everything in the Oracle database is driven by System Change Numbers/Change System Numbers (SCN/CSN).
By using the SCN/CSN, you can identify what the starting point in the table should be and start CDC from there. Any prior to the SCN/CSN will not get captured and would require you to manually move that data in some fashion. That can be done by using Oracle Data Pump (Export/Import).
Oracle GoldenGate also provided a parameter called SQLPredicate that allows you to use a "where" clause against a table. This is handy with initial load extracts because you would do something like TABLE ., SQLPredicate "as of ". Then data before that would be captured and moved to the target side for a replicat to apply into a table. You can reference that here:
https://www.dbasolved.com/2018/05/loading-tables-with-oracle-goldengate-and-rest-apis/
Official Oracle Doc: https://docs.oracle.com/en/middleware/goldengate/core/19.1/admin/loading-data-file-replicat-ma-19.1.html
On the replicat side, you would use HANDLECOLLISIONS to kick out any ducplicates. Then once the load is complete, remove it from the parameter file.
Lots of details, but I'm sure this is a good starting point for you.
That would require programming in java.
1) First you would read your database
2) Decide which data has to be added in which table on the basis of data that was read.
3) Execute update/ data entry queries to submit data to tables.
If you want to run Initial Load using GoldenGate:
Target tables should be empty
Data: Make certain that the target tables are empty. Otherwise, there
may be duplicate-row errors or conflicts between existing rows and
rows that are being loaded. Link to Oracle Documentations
If not empty, you have to treat conflicts. For instance if the row you are inserting already exists in the target table (INSERTROWEXISTS) you should discard it, if that's what you want to do. Link to Oracle Documentation

What process should be done with mysql commands and what should be done with Java?

This is my first time programming and I'm struggling to understand what should be done with mysql commands and what should be done with Java (I'm programming the database with Java because it doesn't need to be used over the web).
Say I have the following pipeline:
get excel file from user. This excel file is identifiable under an ID number.
extract info from the excel file.
save the extracted info into the database. This info needs to remain identifiable by the aforementioned ID number.
find the relevant ID number and get the saved extracted info
create a text document with the info.
What I need help with is part 2.
Should I save the info under an instance of a java class?
Or should I immediately save all the info in the database in a table?
Right now I'm having a hard time even seeing the point of using a database since I'm so accustomed to seeing everything in classes. Please help!
Right now I'm having a hard time even seeing the point of using a database since I'm so accustomed to seeing everything in classes.
Databases help with persistence. If you need to store information between 2 runs of your program, or you need to deal with more data than you can fit in memory at once (including virtual memory) then you need to persist data. If you don't, then you don't need a database.
MySQL in particular is a relational database, so you can often easily retrieve and manipulate portions of data based on relations -- all the widgets that have more than 5 teeth for example.
If you're used to classes, you're used to classes having 1 to 1, 1 to many, and many to many relations with other classes. Databases can have the same. In a 1:1 relation, the columns are in the same table. In 1:many and many:many relations, you have relations between tables that can be joined.
See http://www.databasejournal.com/sqletc/article.php/1469521/Introduction-to-Relational-Databases.htm for an intro to relations in the context of databases.

Categories