Bulk Update in Oracle from CSV file

Bulk Update in Oracle from CSV file - java

I have table and CVS file what i want to do is from csv have to update the table.
csv file as follows (no delta)
1,yes
2,no
3,yes
4,yes
Steps through java
what i have did is read the csv file and make two lists like yesContainList,noContainList
in that list added the id values which has yes and no seperately
make the list as coma seperated strinh
Update the table with the comma seperated string
Its working fine. but if i want to handle lakhs of records means somewhat slow.
Could anyone tell whether is it correct way or any best way to do this update?

There are 2 basic techniques to do this:
sqlldr
Use an external table.
Both methods are explained here:
Update a column in table using SQL*Loader?

Doing jobs like bulk operation, import, exports or heavy SQL operation is not recommended to be done outside RDBMS due to performance issues.
By fetching and sending large tables throw ODBC like API's you will suffer network round trips, memory usage, IO hits ....
When designing a client server application (like J2EE) do you design a heavy batch operation being called and controlled from user interface layer synchronously or you will design a server side process triggered by clients command?.
Think about your java code as UI layer and RDBMS as server side.
BTW RDBMS's have embedded features for these operations like SQLLOADER in oracle.

Related

Informix, MySQL and Oracle blob contains

We have an application that runs with any of IBM Informix, MySQL and Oracle, and we are using Java with Hibernate to connect to the database. We will store XML, CSV and other text-based files inside the database (clob column). The entities in Java are byte[] objects.
One feature request to the application is now to "grep" content inside the data. So I need to find all files with a specific content.
On regular char/varchar fields I can use like '%xyz%', but this is not working on byte[] / blobs.
The first approach was to load each entity, cast the byte[] into a string and use the contains method in Java. If the use enters any filter parameters on other (non-clob) columns, I will apply those filters before testing the clob in order to reduce the number of blobs I have to scan.
That worked quite well for 100 files (clobs) and as long as the application and database are on the same server. But I think it will get really slow if I have 1.000.000 files inside the database and the database is not always in the same network. So I think that is not a good idea.
My next thought was creating a database procedure. But I am not quite sure if this is possible for Informix, MySQL and Oracle. And I am not sure if this is possible.
The last but not favored method is to store the content of the data not inside a clob. Maybe I can use a different datatype for that?
Does anyone has a good idea how to realize that? I need a solution for all three DBMS. The application knows on what kind of DBMS it is connected to. So it would be okay, if I have three different solutions (one for each DBMS).
I am completely open to changing what kind of datatype I use (BLOB, CLOB ...) — I can modify that as I want.
Note: the clobs will range from about 5 KiB to about 500 KiB, with a maximum of 1 MiB.

Look into Apache Lucene or other text indexing library.
https://en.wikipedia.org/wiki/Lucene
http://en.wikipedia.org/wiki/Full_text_search
If you go with a DB specific solution like Oracle Text Search you will have to implement a custom solution for each database. I know from experience that Oracle Text search takes significant time to learn and involves a lot of tweaking to get just right.
Also, if you use a DB solution you would receive different results in each DB even if the data sets were the same (each DB would have it's own methods of indexing and retrieving the data).
By going with a 3rd party solution like Lucene -- you only have to learn one solution and results will be consistent regardless of the Db.

Java Data base query/update for large amount of data

What is the best way to implement the following scenario?
I need to call/query a data base table containing millions of records from a java application. Then for each records in the table, my application should call a third party API and get a status field as response. Then my application should again update each row in the table with the information (status) from the API.
Note - I am trying to figure out a method to do this in the best possible way. I understand that querying all the records together is not the best way forward.

Do not try to eat the elephant in one bite. Chunk it. Heard of pagination? Use it. See here: MySQL pagination without double-querying?

you can use oracle feature such as SQL loader, Data pumping Called via JDBC or script..

Databases are not designed to update millions of records via Java API repeatedly. This can take many minutes. If this is not enough, you may need to use a dataset embedded in Java (either caching or replacing your database)

Why is file system storage faster than SQL databases

Extending this thread - I would just like to know why it's faster to retrieve files from a file system, rather than a MySQL database. If one were to benchmark the two to see which would retrieve the most data (multiple types of data) over 10 minutes - which one would win?
If a file system is truly faster, then why not just store everything in a file system and replace a database with csv or xml?
EDIT 1:
I found a good resource for alternate storage options for java
EDIT 2:
I'm looking for a Java API/Jar that has the functionality of a SQL Database Server Engine (or at least some of it) that uses XML for data storage (preferably). If you know of something, please leave a comment below.

At the end of the day the database does just store the data in the file system. It's all the useful stuff on top of just the raw data that makes you decide to use a database.
If you can replicate the functionality, scalability, robustness, integrity, etc, etc of a database system using CSV and still make it perform faster than a relational database then yes I'd suggest doing it your way.
It'd take you a few years to get there though.
Of course, relational systems are not the only way to store data. There are object-oriented database systems (db4o, InterSystems Cache) and document-based systems (RavenDB).
Performance is also relative to the style and volume of data you are working with and what you intend to do with it - I'm not going to even try and discuss that, it's too open ended.
I will also not start the follow on discussion: if memory is truly faster than the file system, why not just store everything in memory? :-)
This also seems similar to another question I answered a long while ago:
Is C# really slower than say C++?
Basically stuff isn't always done just for performance.

MySQL uses the file system the same as everything else on a computer. To retrieve a single piece of data, or a table of data, there is no faster way that directly from the file system. MySQL would just be a small bit of overhead added to that file system pull.
If you need to do some intelligent selecting, match some rows, or filter that data, MySQL is going to do that faster than most other options. The database server provides you calculation and data manipulation power that a filesystem can't.

When you have mixed/structured data, a DBMS is the only solution. For eg. try to get the people's name, surname and country for all your customers stored into your DB, but only those born in 1981 and living in Rome. If you have this data into files on the filesystem, how do you easily get only the required data without scanning all your files and how do you join returned data?
A DBMS give you much more than that.
Many DBMS store data into files.
This abstraction layer will make you retrieve data in a very easily, standard and structured way.

The difference is in how the desired data is located.
In a file system, locating the desired data means searching through all existing data until you find it.
Databases provide indexing which results in locating the desired data almost immediately (within ~12 comparisons) regardless of the amount of data.
What we want is an indexed file system - lucky for us, we have them. They are called databases.

Justification of the need for an in-memory database

My use case is as follows --
I have a database table with around 1000+ entries and this table is updated/edited infrequently but i expect this to change in future. Some of the columns in the table contain strings that are of considerable length.
Now I am in the process of writing a UI application that will have some mouseover events that will display texts derived from the aforementioned database table.
I have, for my use case, decided to write a backend 'server' that will host an in-memory database that will have all the data that was present in the aforementioned table. The UI app will now, on startup, cache the required data from the in-memory database present or hosted by the backend server.
Does my use case justify using an in-memory database ? If not, what are the alternatives I should consider ?
EDIT 1 --
My use case also involves running multiple searches of varying complexity on the database very frequently.
Thanks
p1ng

Seems like an excellent use-case for an in-memory database. Writing it yourself, on the other hand, is probably not the way to go.
There are plenty of existing options for just about any imaginable scenario: http://en.wikipedia.org/wiki/In-memory_database
If you're doing complex searches on text data, Lucene is quite excellent. It has special in-memory storage backends, but really, it doesn't matter for such a tiny dataset - it will always be quickly cached anyway.

Is a good idea do processing of a large amount of data directly on database?

I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?

My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper Name´ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.

Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.