Fetching millions of records in java [closed]

Fetching millions of records in java [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Very Open question,
I need to write a java client that reads millions of records (let's say account information) from an Oracle database. Dump it into a XML and send it through webservices to a vendor.
What is the most optimized way to do this? starting from fetching the millions of records. I Went the JPA/hibernate route I got outofMemory errors fetching 2 million records.
Is JDBC better approach? fetch each row and build the XML as I go? any other alternatives?
I am not an expert in Java so any guidance is appreciated.

We faced similar problem sometime back and our record size was in excess of 2M. This is how we approached.
Using any OR mapping tool is simply ruled out due to large overheads like creation of large POJOs which basically is not required if the data is to be dumped to an XML.
Plain JDBC is the way to go. The main advantage of this is that it returns a ResultSet object which actually does not contain all the results at once. So loading of entire data in memory is solved. The data is loaded as we iterate over the ResultSet
Next comes the creation of XML file. We create an XML file and opened than in Append mode.
Now in loop where we iterate over Resultset object, we create XML fragments and then append the same to the XML file. This goes on till entire Resultset is iterated.
In the end what we have is XML file will all the records.
Now for sharing this file, we created a web services which would return the URL to this XML file (archived/zipped) if the file is available.
The client could download this file anytime after this.
Note this this is not a synchronous system, meaning The file does not become available after the client makes the call. Since creating XML call takes a lot of time, HTTP wold normally timeout hence this approach.
Just an approach you can take clue from. Hope this helps.

Use ResultSet#setFetchSize() to optimize the records fetched at time from database.
See What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?
In JDBC, the ResultSet#setFetchSize(int) method is very important to
performance and memory-management within the JVM as it controls the
number of network calls from the JVM to the database and
correspondingly the amount of RAM used for ResultSet processing.
Read here about Oracle ResultSet Fetch Size

For this size of data, you can probably get away with starting java with more memory. Check out using -Xmx and -Xms when you start Java.
If your data is truly too big to fit in memory, but not big enough to warrant investment in different technology, think about operating in chunks. Does this have to be done at once? Can you slice up the data into 10 chunks and do each chunk independently? If it has to be done in one shot, can you stream data from the database, and then stream it into the file, forgetting about things you are done with (to keep memory use in the JVM low)?

Read the records in chunks, as explained by previous answers.
Use StAX http://stax.codehaus.org/ to stream the record chunks to your XML file as opposed to all records into one large document

As far as the Hibernate side is concerned, fetch using a SELECT query (instead of a FROM query) to prevent filling up the caches; alternatively use a statelessSession. Also be sure to use scroll() instead of list(). Configuring hibernate.jdbc.fetch_size to something like 200 is also recommended.
On the response side, XML is a quite bad choice because parsing is difficult. If this is already set, then make sure you use a streaming XML serializer. For example, the XPP3 library contains one.

While a reasonable Java approach would probably involve a StAX construction of your XML in conjunction to paginated result sets (straightforward JDBC or JPA), keep in mind that you may need to lock your database for updates all the while which may or may not be acceptable in your case.
We took a different, database-centric approach using stored procedures and triggers on INSERT and UPDATE to generate the XML node corresponding to each row/[block of] data. This constantly ensures that 250GB+ of raw data and its XML representation (~10 GB) are up-to-date and reduces (no pun intended) the export to a mere concatenation matter.

You can still use Hibernate to fetch millions of data, it's just that you cannot do it in one round because millions is a big number and of course you will have out of memory exception. You can divide it into pages and then dump to XML each time, so that the records won't be keep in RAM and your program would not be needing so huge of memory.
I have these 2 methods in my previous project that I used very frequently. Unfortunately I did not like to use HQL so much so I don't have the code for that.
So here INT_PAGE_SIZE is the amount of rows that you would like to fetch each round, and getPageCount is to get the amount of total rounds to do to fetch all of the records.
Then paging is to fetch the records by page, from 1 to getPageCount.
public int getPageCount(Criteria criteria) {
ProjectionList pl = Projections.projectionList();
pl.add(Projections.rowCount());
criteria.setProjection(pl);
int rowCount = (Integer) criteria.list().get(0);
criteria.setProjection(null);
if (rowCount % INT_PAGE_SIZE == 0) {
return rowCount / INT_PAGE_SIZE;
}
return rowCount / INT_PAGE_SIZE + 1;
}
public Criteria paging(Criteria criteria, int page) {
if (page != -1) {
criteria.setFirstResult((page - 1) * INT_PAGE_SIZE);
criteria.setMaxResults(INT_PAGE_SIZE);
}
return criteria;
}

Related

JDBC Read without cursor

I have to read huge data from the database (for example lets consider more than 500 000 records). Then I have to save the read data to a file. I have many issues with cursor (not only memory issue).
Is it possible to do it without cursor, for example using stream? If so how can I achieve it?

I have experienced working with huge data (almost 500 milions of records). I simply used PreparedStatement query, ResultSet and of cource some buffer tweaking through:
setFetchSize(int)
In my case, i split the program into threads because the huge table was partitioned (each thread processed one partition) but i think that this is not your case.
It is pointless to fetch data through cursor. I would rather use the database view or SQL query. Do not use ORM for this purpose.
According to your comment, your best option is to limit JDBC to fetch only specific number of rows instead of fetching all of them (this helps to begin processing faster and does not load entire table into ResultSet). Save your data into collection and write it into file using BufferedWriter. You can also benefit from multi-core CPU to make it run in more threads - like first fetched rows run in 1 thread, other fetched rows in second thread. In case of threading, use synchronized collections and be aware that you might face the problem of ordering.

Performance Optimization in Java

In Java code I am trying to fetch 3500 rows from DB(Oracle). It takes almost 15 seconds to load the data. I have approached storing the result in Cache and retrieving from it too. I am using simple Select statement and displaying 8 columns from a single table (No joins used) .Using List to save the data from DB and using it as source for Datatable. I have also thought from hardware side such as RAM capacity, Storage, Network speed etc... It exceeds the minimum requirements comfortably. Can you help to do it quicker (Shouldn't take more than 3 seconds)?

Have you implemented proper indexing to your tables? I don't like to ask this since this is a very basic way of optimizing your tables for queries and you mention that you have already tried several ways. One of the workarounds that works for me is that if the purpose of the query is to display the results, the code can be designed in such a way that the query should immediately display the initial data while it is still loading more data. This implies to implement a separate thread for loading and separate thread for displaying.

It is most likely that the core problem is that you have one or more of the following:
a poorly designed schema,
a poorly designed query,
an badly overloaded database, and / or
a badly overloaded / underprovisioned network connection between the database and your client.
No amount of changing the client side (Java) code is likely to make a significant difference (i.e. a 5-fold increase) ... unless you are doing something crazy in the way you are building the list, or the bottleneck is in the display code not the retrieval.
You need to use some client-side and server-side performance tools to figure out whether the real bottleneck is the client, the server or the network. Then use those results to decide where to focus your attention.

Result Set to Multi Hash Map

I have a situation here. I have a huge database with >10 columns and millions of rows. I am using a matching algorithm which matches each input records with the values in database.
The database operation is taking lot of time when there are millions of records to match. I am thinking of using a multi-hash map or any resultset alternative so that i can save the whole table in memory and prevent hitting database again....
Can anybody tell me what should i do??

I don't think this is the right way to go. You are trying to do the database's work manually in Java. I'm not saying that you are not capable of doing this, but most databases have been developed for many years and are quite good in doing exactly the thing that you want.
However, databases need to be configured correctly for a given type of query to be executed fast. So my suggestion is that you first check whether you can tweak the database configuration to improve the performance of the query. The most common thing is to add the right indexes to your table. Read How MySQL Uses Indexes or the corresponding part of the manual of your particular database for more information.
The other thing is, if you have so much data storing everything in main memory is probably not faster and might even be infeasible. Not to say that you have to transfer the whole data first.
In any case, try to use a profiler to identify the bottleneck of the program first. Maybe the problem is not even on the database side.

best way to store huge data into mysql using java

I am a Java developer. I want to know what is the best way to store huge data into mysql using Java.
Huge: two hundred thousand talk messages every second.
An index is not needed here
Should I store the messages into the database as soon as the user creates them? Will it be too slow?

1 billion writes / day is about 12k / second. Assuming each message is about 16 bytes, that's about 200k / sec. If you don't care about reading, you can easily write this to disk at this rate, maybe one message per line. Your read access pattern is probably going to dictate what you end up needing to do here.
If you use MySQL, I'd suggest combining multiple messages per row, if possible. Partitioning the table would be helpful to keep the working set in memory, and you'll want to commit a number of records per transaction, maybe 1000 rows. You'll need to do some testing and tuning, and this page will be helpful:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You should probably also look at Cassandra which is written with heavy write workloads in mind.

My suggestion is also MongoDB. Since NoSQL paradigm fits your needs perfectly.
Below is a flavor of MongoDB in Java -
BasicDBObject document = new BasicDBObject();
document.put("database", "mkyongDB");
document.put("table", "hosting");
BasicDBObject documentDetail = new BasicDBObject();
documentDetail.put("records", "99");
documentDetail.put("index", "vps_index1");
documentDetail.put("active", "true");
document.put("detail", documentDetail);
collection.insert(document);
This tutorial is for good to get started. You can download MongoDB from github.
For optimization of MongoDB please refer this post.

Do you have to absolutely use MySQL or Are you open to other DBs as well? MongoDb or CouchDB will be a good fit for these kind of needs. Check them out if you are open to other DB options.
If you have to go absolutely with MySql, then we have done something similar all the related text messages go in a child as single json. We append to it every time and we keep master in a separate table. So one master and one child record at the minimum and more child records as the messages go beyond certain number ( 30 in our scenario) , implemented kind of "load more.." queries second child record which holds 30 more.
Hope this helps.
FYI, we are migrating to CouchDB for some other reasons and needs.

There are at least 2 different parts to this problem:
Processing the messages for storage in the database
What type of storage to use for the message
For processing the messages, you're likely going to need a horizontally scalable system (meaning you can add more machines to process the messages quickly) so you don't accumulate a huge backlog of messages. You should definitely not try to write these messages synchronously, but rather when a message is received, put it on a queue to be processed for writing to the database (something like JMS comes to mind here).
In terms of data storage, MySQL is a relational database, but it doesn't sound like you are really doing any relational data processing, rather just storing a large amount of data. I would suggest looking into a NoSQL database (as others have suggested here as well) such as MongoDB, Cassandra, CouchDB, etc. They each have their strengths and weaknesses (you can read more about each of them on their respective websites and elsewhere on the internet).

I guess, typical access would involve retrieving all text of one chat session at least.
The number of rows is large and your data is not so much relational. This is a good fit for Non-Relational database.
If you still want to go with MySQL, use Partitions. While writing, use batch inserts and while reading provide sufficient Partition pruning hints in your queries. Use EXPLAIN PARTITIONS to check whether partitions are being pruned. In this case I would strongly recommend that you combine chat lines of a one chat session into a single row. This will dramatically reduce the number of rows as compared to one chat line per row.
You didn't mention how many many days of data you want to store.
On a separate note: How successful would your app have to be in terms of users to require 200k messages per second? An active chat session may generate about 1 message every 5 seconds from one user. For ease of calculation lets make it 1 second. So you are building capacity for 200K online users. Which implies you would at least have a few million users.
It is good to think of scale early. However, it requires engineering effort. And since resources are limited, allocate them carefully for each task (Performance/UX etc). Spending more time on UX, for example, may yield a better ROI. When you get to multi-million user territory, new doors will open. You might be funded by an Angel or VC. Think of it as a good problem to have.
My 2 cents.

Is a good idea do processing of a large amount of data directly on database?

I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?

My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper Name´ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.

Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.