i'm parsing large log files (5+GB) and extracting ad-hoc placed profiling lines (call name and execution time). I want to insert those lines into a MySql db.
My question is: should I execute the insert statement every time I get the line while parsing or there's some best practice to speed up everything?
If there is any way that you could do a bulk insert, that would help a lot (or at least send your data to the database in batches, instead of making separate calls each time).
Edit
LOAD DATA INFILE sounds even faster ;o)
https://web.archive.org/web/20150413042140/http://jeffrick.com/2010/03/23/bulk-insert-into-a-mysql-database
There are better options.
See http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
In your case, I think writing the relevant records to a file and then using LOAD DATA INFILE is the best approach.
For small updates, the number of transactions is critical for performance. SO if you can perform a number of inserts in the same transaction it will go much faster. I would try 100 inserts per transaction first.
If you don't want to follow the recommendations in Galz's link ( which is excellent BTW ) then try to open the connection and prepare the statement once, then loop round your log files carrying out the inserts ( using the premared statement ), then finally close the statement and connection once at the end. It's not the fastest way of doing the inserts, but it's the fastest way that sticks to a "normal" JDBC approach.
From java
JDBC batch insert
Example:
You do this with every insert: http://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/Indexer.java#232
You do this with every batch http://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/Indexer.java#371
The size of the batch can be determined by the available memory.
Aside from insert speed, the other problem you may run into is memory. Whatever approach you use, you will still need to consider your memory usage as the records are loaded from the file. Unless you have a hard requirement on processing speed, then it may be a better to use an approach with a predictable memory foot print.
Related
I have a JAVA application that can use a SQL database from any vendor. Right now we have tested Vertica and PostgreSQL. I want to export all the data from one table in the DB and import it later on in a different instance of the application. The size of the DB is pretty big so there are many rows in there. The export and import process has to be done from inside the java code.
What we've tried so far is:
Export: we read the whole table (select * from) through JDBC and then dump it to an SQL file with all the INSERTS needed.
Import: The file containing those thousands of INSERTS is executed in the target database through JDBC.
This is not an efficient process. Firstly, the select * from part is giving us problems because of the size of it and secondly, executing a lot if inserts one after another gives us problems in Vertica (https://forum.vertica.com/discussion/235201/vjdbc-5065-error-too-many-ros-containers-exist-for-the-following-projections)
What would be a more efficient way of doing this? Are there any tools that can help with the process or there is no "elegant" solution?
Why not do the export/import in a single step with batching (for performance) and chunking (to avoid errors and provide a checkpoint where to start off after a failure).
In most cases, databases support INSERT queries with many values, e.g.:
INSERT INTO table_a (col_a, col_b, ...) VALUES
(val_a, val_b, ...),
(val_a, val_b, ...),
(val_a, val_b, ...),
...
The number of rows you generate into a single such INSERT statement is then your chunk-size, which might need tuning for the specific target database (big enough to speed things up but small enough to make the chunk not exceed some database limit and create failures).
As already proposed, each of this chunk should then be executed in a transaction and your application should remember which chunk it successfully executed last in case some error occurs so it can continue at the next run there.
For the chunks itself, you really should use LIMIT OFFSET .
This way, you can repeat any chunk at any time, each chunk by itself is atomic and it should perform much better than with single row statements.
I can only speak about PostgreSQL.
The size of the SELECT is not a problem if you use server-side cursors by calling setFetchSize with a value greater than 0 (perhaps 10000) on the statement.
The INSERTS will perform well if
you run them all in a single transaction
you use a PreparedStatement for the INSERT
Each insert into Vertica goes into WOS (memory), and periodically data from WOS gets moved to ROS (disk) into a single container. You can only have 1024 ROS containers per projection per node. Doing many thousands of INSERTs at a time is never a good idea for Vertica. The best way to do this is to copy all that data into a file and bulk load the file into Vertica using the COPY command.
This will create a single ROS container for the contents of the file. Depending on how many rows you want to copy it will be many times (sometimes even hundreds of times) faster.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ConnectingToVertica/ClientJDBC/UsingCOPYLOCALWithJDBC.htm
I have to read huge data from the database (for example lets consider more than 500 000 records). Then I have to save the read data to a file. I have many issues with cursor (not only memory issue).
Is it possible to do it without cursor, for example using stream? If so how can I achieve it?
I have experienced working with huge data (almost 500 milions of records). I simply used PreparedStatement query, ResultSet and of cource some buffer tweaking through:
setFetchSize(int)
In my case, i split the program into threads because the huge table was partitioned (each thread processed one partition) but i think that this is not your case.
It is pointless to fetch data through cursor. I would rather use the database view or SQL query. Do not use ORM for this purpose.
According to your comment, your best option is to limit JDBC to fetch only specific number of rows instead of fetching all of them (this helps to begin processing faster and does not load entire table into ResultSet). Save your data into collection and write it into file using BufferedWriter. You can also benefit from multi-core CPU to make it run in more threads - like first fetched rows run in 1 thread, other fetched rows in second thread. In case of threading, use synchronized collections and be aware that you might face the problem of ordering.
Hi I am trying to write to Sybase IQ using JDBC from a file which contains thousands of rows. People say that I should use batchUpdate. So I am reading file by NIO and adding it to PreparedStatement batches. But I dont see any advantage here for all the rows I need to do the following
PreparedStatement prepStmt = con.prepareStatement(
"UPDATE DEPT SET MGRNO=? WHERE DEPTNO=?");
prepStmt.setString(1,mgrnum1);
prepStmt.setString(2,deptnum1);
prepStmt.addBatch();
I dont understand what is the advantage of batches. I have to anyhow execute addBatch for thousands of time for all the records of file. Or Should I even be using addBatch() to write records from a file to sybase iq. Please guide. Thanks a lot.
With batch updates, basically, you're cutting down on your Network I/O overhead. It's providing the benefits analogous to what a BufferedWriter provides you while writing to the disk. That's basically what this is: buffering of database updates.
Any kind of I/O has a cost; be it disk I/O or network. By buffering your inserts or updates in a batch and doing a bulk update you're minimizing the performance hit incurred every time you hit the database and come back.
The performance hit becomes even more obvious in case of a real world application where the database server is almost always under some load serving other clients as opposed to development where you're the only one.
When paired with a PreparedStatement the bulk updates are even more efficient because the Statement is pre-compiled and the execution plan is cached as well throughout the execution of the batch. So, the binding of variables happen as per your chosen batch size and then a single batchUpdate() call persists all the values in one go.
The advantage of addBatch is that it allows the jdbc driver to write chunks of data instead of sending single insert statements to the database.
This can be faster in certain situations, but real life performance may vary.
It should also be noticed that it's recommended to use batches of 50-100 rows, instead of adding all the data into a single batch.
If I want to fetch million rows in hibernate, how would it work? Will hibernate crash? How can I optimize that.
typically you wouldn't use hibernate for this. If you need to do a batch operation, use sql or the hibernate wrappers for batch operations. There is no way loading millions of records is going to end well for your application. Your app with thrash as the gc runs, or possibly crash. there has to be another option.
If you read one/write one it will probably work fine. Are you sure this the way you want to read 1,000,000 rows? It will likely take a while.
If you want all the objects to be in memory at the same time, you might well be challenged.
You can optimize it best, probably, by finding a different way. For example, you can dump from the database using database tools much more quickly than reading with hibernate.
You can select sums, maxes, and counts in the database without returning a million rows over the network.
What are you trying to accomplish, exactly?
For this you would be better off using spring's jdbc tools with a row handler. It will run the query and then perform some action on a row at a time.
Bring only the columns you need. Try it out in a test environment.
You should try looking at the StatelessSession interface and example of which can be found here:
http://mrmcgeek.blogspot.com/2010/09/bulk-data-loading-with-hibernate.html
I am connecting oracle db thru java program. The problem is i am getting Outofmemeory exception because the sql is returning 3 million records. I cannot increase the JVM heapsize for some reason.
What is the best solution to solve this?
Is the only option is to run the sql with LIMIT?
If your program needs to return 3 mil records at once, you're doing something wrong. What do you need to do that requires processing 3 mil records at once?
You can either split the query into smaller ones using LIMIT, or rethink what you need to do to reduce the amount of data you need to process.
In my opinion is pointless to have queries that return 3 million records. What would you do with them? There is no meaning to present them to the user and if you want to do some calculations it is better to run more than one queries that return considerably fewer records.
Using LIMIT is one solution, but a better solution would be to restructure your database and application so that you can have "smarter" queries that do not return everything in one go. For example you could return records based on a date column. This way you could have the most recent ones.
Application scaling is always an issue. The solution here will to do whatever you are trying to do in Java as a stored procedure in Oracle PL/SQL. Let oracle process the data and use internal query planners to limit amount of data flowing in an out and possibly causing major latencies.
You can even write the stored procedure in Java.
Second solution will be to indeed make a limited query and process from several java nodes and collate results. Look up map-reduce.
If each record is around 1 kilobyte that means 3gb of data, do you have that amount of memory available for your application?
Should be better if you explain the "real" problem, since OutOfMemory is not your actual problem.
Try this:
http://w3schools.com/sql/sql_where.asp
There could be three possible solutions
1. If retreiving 3million records at once is not necessary.. Use LIMIT
Consider using meaningful where clause
Export database entries into txt or csv or excel format with the tool that oracle provides and use that file for your use..
Cheers :-)
reconsider your where clause. see if you can make it more restrictive.
and/or
use limit
Just for reference, In Oracle queries, LIMIT is ROWNUM
Eg., ... WHERE ROWNUM<=1000
If you get that large a response then take care to process the result set row by row so the full result does not need to be in memory. If you do that properly you can process enormous data sets without problems.