Download completeDB as CSV - java

I have a complicated database that deals with many tables. Some of them are related as in the following table:
Now the requirement is to collect all the corresponding rows in all of the tables from one identifiable field in the entry level table, and download it as CSV.
What comes to my mind is a simple iterative strategy and storing relevant data. But this seems inefficient since the query goes too long and have to iterate it a few times to get everything I need.
Is there any better approach to this problem? I'm using Jsp-Java-spring-MySQL.

I would suggest using a command line of mysql or mysqldump utility(fastest approach) and you can also use a DB tool like Toad or mysql workbench. See if these posts helps :
mysqlworkbench
mysqldump

Related

Bulk Inserting Data into PostgreSQL

I have SpringBoot project which will pull a large amount of data from one database, do some kind of transformation on it, and then insert it into a table in a PostgreSQL database. This process will continue for a few billion records so performance is key.
I've been researching trying to find the best way to do this, such as using an ORM or a JDBCTemplate for example. One thing I keep seeing constantly regarding bulk inserts into PostgreSQL is the COPY command. https://www.postgresql.org/docs/current/populate.html
I'm confused because using COPY requires the data to be written into a file, and while I've seen people saying to use it I've yet to come across a case where someone mentions how to get the data into the file. Isn't writing to a file slow? If writing to a file is slow, then the performance gains that COPY does bring, does this make it be like there was no gain at all?
These kind of data migration and conversion is better to handle in Stored procedures. Assuming that the source data is already loaded to postgres ( if not use postgres db utility to load the raw data to some flat table). Then write series of stored procs to transform the data and insert into the destination table.
I have done some complex data migration and i used this approach. If you have to do lot of complex data conversion, write some python script ( which is usually faster than spring boot/data setup), insert the parially converted data, then do some stored procs to do the final conversion.
It is better to keep the business logic to convert/massage data close to the datasource ( in stored procs) instead of pulling data to app server and reinserting them.
Hope it helps.

Get or Retrieve Generated PKs after a massive insert SQLLDR

I'll be direct about my situation right now. I'm working in a project which will perform a "Base load" procedure based on an excel (xlsx, xls) file. It has been developed in java with JDBC drivers. right now this project is working, It takes an excel file and based on a configuration It performances the insert into differents tables. The point is: It's taking too long doing the job, which makes it inefficient. (It takes around 2 hours inserting 3000 records on DB). in the future, this software will be inserted around 30k records and it will be painfully slow. So I need to improve its efficience and I was thinking in: Instead of inserting from java via JDBC drivers. I will generate control files and data files to be inserted in the DB using SQLLDR.
The point I'm facing right now, I need to insert these data into several tables, and this tables are related to each other. That's means, If I insert a person into "Person_table" I will need the Primary Key generated by a database sequence to insert the "Address, Phone, email, etc." into other table, so I do not know how to get the primary keys generated in the first insert via SQLLDR.
I'm not sure sure yet if SQLLDR is my best way to do this, but I guess It is, because the DBMS is Oracle
Can you guys lead me about how could I do what I explained you guys I need to do? any suggestion is welcome and well received. It does not matter if your suggestions are not about how to do this with SQLLDR.
I'm a kind of stuck at this point right now, I really appreciate the help you could give me.
SQL*Loader can't read native Excel files (at least, as far as I know). Therefore, you'll have to save the result as a CSV file.
As you need to manipulate foreign key constraints, consider switching to external tables feature - basically, the background is still SQL*Loader, but you can write (PL/)SQL against those files/tables (yes - a CSV file, stored on a hard disk, acts as if it was an Oracle table).
So, you'd "load" one table, populate primary key values, populate another (child) table - possibly into a "temporary" (not necessarily a global temporary table) which doesn't have any constraints enabled, populate foreign key values and move data into a "real" target table whose constraints now won't fail.
Possible drawback: CSV files have to reside in a directory that is accessible to the database server, as you'll have to create a directory (Oracle object) and grant required privileges (usually read, write) to user who will be using it. Directory is usually created on a server itself; if not, you'll have to use UNC while creating it.
Now you have something to read about/research; see if it makes sense to you.

Write to Hive in Java MapReduce Job

I am currently working on a Java MapReduce job, which should output data to a bucketed Hive table.
I think of two approaches:
First directly write to Hive via HCatalog. The problem is, that this approach does not support writing to a bucketed Hive table. Hence, when using a bucketed Hive table, I need to first write to a non-bucketed table and then copy it to the bucketed one.
The second option is to write the output to a text file and load this data into Hive afterwards.
What is the best practice here?
Which approach is more performant with a huge amount of data (with respect to memory and time taken)?
Which approach would be the better one, if I could also use non-bucketed Hive tables?
Thanks a lot!
For non-bucketed tables, you can store your MapReduce output in the table storage location. Then you'd only need to run MSCK REPAIR TABLE to update the metadata with the new partitions.
Hive's load command actually just copies the data to the table storage location.
Also, from HIVE documentation:
The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
So you'd need to tweak your MapReduce output to fit these constrains.

Informix, MySQL and Oracle blob contains

We have an application that runs with any of IBM Informix, MySQL and Oracle, and we are using Java with Hibernate to connect to the database. We will store XML, CSV and other text-based files inside the database (clob column). The entities in Java are byte[] objects.
One feature request to the application is now to "grep" content inside the data. So I need to find all files with a specific content.
On regular char/varchar fields I can use like '%xyz%', but this is not working on byte[] / blobs.
The first approach was to load each entity, cast the byte[] into a string and use the contains method in Java. If the use enters any filter parameters on other (non-clob) columns, I will apply those filters before testing the clob in order to reduce the number of blobs I have to scan.
That worked quite well for 100 files (clobs) and as long as the application and database are on the same server. But I think it will get really slow if I have 1.000.000 files inside the database and the database is not always in the same network. So I think that is not a good idea.
My next thought was creating a database procedure. But I am not quite sure if this is possible for Informix, MySQL and Oracle. And I am not sure if this is possible.
The last but not favored method is to store the content of the data not inside a clob. Maybe I can use a different datatype for that?
Does anyone has a good idea how to realize that? I need a solution for all three DBMS. The application knows on what kind of DBMS it is connected to. So it would be okay, if I have three different solutions (one for each DBMS).
I am completely open to changing what kind of datatype I use (BLOB, CLOB ...) — I can modify that as I want.
Note: the clobs will range from about 5 KiB to about 500 KiB, with a maximum of 1 MiB.
Look into Apache Lucene or other text indexing library.
https://en.wikipedia.org/wiki/Lucene
http://en.wikipedia.org/wiki/Full_text_search
If you go with a DB specific solution like Oracle Text Search you will have to implement a custom solution for each database. I know from experience that Oracle Text search takes significant time to learn and involves a lot of tweaking to get just right.
Also, if you use a DB solution you would receive different results in each DB even if the data sets were the same (each DB would have it's own methods of indexing and retrieving the data).
By going with a 3rd party solution like Lucene -- you only have to learn one solution and results will be consistent regardless of the Db.

Result Set to Multi Hash Map

I have a situation here. I have a huge database with >10 columns and millions of rows. I am using a matching algorithm which matches each input records with the values in database.
The database operation is taking lot of time when there are millions of records to match. I am thinking of using a multi-hash map or any resultset alternative so that i can save the whole table in memory and prevent hitting database again....
Can anybody tell me what should i do??
I don't think this is the right way to go. You are trying to do the database's work manually in Java. I'm not saying that you are not capable of doing this, but most databases have been developed for many years and are quite good in doing exactly the thing that you want.
However, databases need to be configured correctly for a given type of query to be executed fast. So my suggestion is that you first check whether you can tweak the database configuration to improve the performance of the query. The most common thing is to add the right indexes to your table. Read How MySQL Uses Indexes or the corresponding part of the manual of your particular database for more information.
The other thing is, if you have so much data storing everything in main memory is probably not faster and might even be infeasible. Not to say that you have to transfer the whole data first.
In any case, try to use a profiler to identify the bottleneck of the program first. Maybe the problem is not even on the database side.

Categories