Java handling large amounts of data

Java handling large amounts of data - java

I have a Java application that needs to display large amounts of data (on the order of 1 million data points). The data doesn't all need to be displayed at the same time but rather only when requested by a user. The app is a desktop app that is not running with an app server or hitting any centralized database.
My thought was to run a database on the machine and load the data in there. The DB will be read only most of the time, so I should be able to index to help optimize queries. If I'm running on a local system, I'm not sure if I should try and implement some caching (I'm not sure how fast the queries will run, I'm currently working on them).
Is this is a logical way to approach the problem or would there be better approaches?
Thanks,
Jeff

Display and data are two different things.
You don't give any details about either, but it could be possible to generate the display in the background, bringing in the data one slice at a time, and then displaying when it's ready. Lots of anything could cause memory issues, so you'll need to be careful. The database will help persist things, but it won't help you get ten pounds of data into your five pound memory bag.
UPDATE: If individuals are only reading a few points at a time, and display isn't an issue, then I'd say that any database will be able to handle it if you index the table appropriately. One million rows isn't a lot for a capable database.

Embedded DB seems reasonable. Check out JavaDB/Derby or H2 or HSQLDB.
Sqlite with a java wrapper is fine too.

It really depends on your data. Do multiple instances request the data? If not, it is definitely worth to look for a simple SQLite database as the storage. It is just a single file on your file system. No need to set up a server.

Well, depends on data size. 1 Million integers for example isnt that much, but 1 Million data structures/classes or whatever with, lets say, 1000 Bytes size is much.
For small data: keep them in memory
For large data: i think using the DB would be good.
Just my opinion :)
edit:
Of course it depends also on the speed you want to achieve. If you really need high speed and the data is big you could also cache some of them in memory and leave the rest in the db.

Related

Performance Optimization in Java

In Java code I am trying to fetch 3500 rows from DB(Oracle). It takes almost 15 seconds to load the data. I have approached storing the result in Cache and retrieving from it too. I am using simple Select statement and displaying 8 columns from a single table (No joins used) .Using List to save the data from DB and using it as source for Datatable. I have also thought from hardware side such as RAM capacity, Storage, Network speed etc... It exceeds the minimum requirements comfortably. Can you help to do it quicker (Shouldn't take more than 3 seconds)?

Have you implemented proper indexing to your tables? I don't like to ask this since this is a very basic way of optimizing your tables for queries and you mention that you have already tried several ways. One of the workarounds that works for me is that if the purpose of the query is to display the results, the code can be designed in such a way that the query should immediately display the initial data while it is still loading more data. This implies to implement a separate thread for loading and separate thread for displaying.

It is most likely that the core problem is that you have one or more of the following:
a poorly designed schema,
a poorly designed query,
an badly overloaded database, and / or
a badly overloaded / underprovisioned network connection between the database and your client.
No amount of changing the client side (Java) code is likely to make a significant difference (i.e. a 5-fold increase) ... unless you are doing something crazy in the way you are building the list, or the bottleneck is in the display code not the retrieval.
You need to use some client-side and server-side performance tools to figure out whether the real bottleneck is the client, the server or the network. Then use those results to decide where to focus your attention.

H2 performance recommendations

I'm currently working with a somewhat larger database, and though I have no specific issues, I would like some recommendations, if anyone has any.
The database is 2.2 gigabyte (after recreation/compacting). It contains about 50 tables. One of those tables contains a blob plus some metadata. It currently has about 22000 rows. If I remove the blobs from the table (UPDATE table SET blob = null), the database size is reduced to about 200 megabyte (after recreation/compacting). The metadata is accessed a lot, the blobs however are not that often needed.
The database URL I currently use is:
jdbc:h2:D:/data;AUTO_SERVER=true;MVCC=true;CACHE_SIZE=524288
It runs in our Java VM which has 4GB max heap.
Some things I was wondering:
Would running H2 in a separate process have any impact on performance (for better or for worse)?
Would it help to have the blobs in a separate table with a 1-1 relation to the metadata? I could imagine it would help with the caching, not having the blobs in the way?
The internet seems divided on whether to include blobs in a database or write them to files on a filesystem with a link in the DB. Any H2-specific advise here?

The answer for you depends on the growth rate of your blob data. If for example, your data set is going to grow at 10% per week - then there is little point of trying to extend the use of H2 to store blob data (as it will quickly out pace the available heap memory). If instead the blob data is the biggest it will ever be, then attempting to use H2 might make sense.
To answer your questions about H2:
1) Running H2 in a separate process will allow H2 claim the majority of heap space - making controlling the available heap space for H2 much more manageable. However, you'll also be adding the maintenance overhead of having a separate process to maintain and monitor. So the answer is "it depends on your operating environment and goals". If you have the people and time, running H2 in a separate process might make sense. But if that's true - then you should probably consider just running an appropriate blob storage platform instead.
2) Yes, you're correct that storing the blobs in a separate table would help with caching - in the case that you don't often need the blobs. It should also help with retrieval times, as H2 won't have to read past the blobs to find the metadata.
3) Note that "the internet" represents many thousands of people with almost as many different specific use cases. You'll need to filter down your use case into requirements, and then apply the logic you glean from others.
4) My personal advice is, if you're trying to make a scalable and maintainable platform - use the right tools. H2, or any other relational database, is most often not the right tool for storing many large blobs. I'd recommend that you investigate using a key/value store.

best way to store huge data into mysql using java

I am a Java developer. I want to know what is the best way to store huge data into mysql using Java.
Huge: two hundred thousand talk messages every second.
An index is not needed here
Should I store the messages into the database as soon as the user creates them? Will it be too slow?

1 billion writes / day is about 12k / second. Assuming each message is about 16 bytes, that's about 200k / sec. If you don't care about reading, you can easily write this to disk at this rate, maybe one message per line. Your read access pattern is probably going to dictate what you end up needing to do here.
If you use MySQL, I'd suggest combining multiple messages per row, if possible. Partitioning the table would be helpful to keep the working set in memory, and you'll want to commit a number of records per transaction, maybe 1000 rows. You'll need to do some testing and tuning, and this page will be helpful:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You should probably also look at Cassandra which is written with heavy write workloads in mind.

My suggestion is also MongoDB. Since NoSQL paradigm fits your needs perfectly.
Below is a flavor of MongoDB in Java -
BasicDBObject document = new BasicDBObject();
document.put("database", "mkyongDB");
document.put("table", "hosting");
BasicDBObject documentDetail = new BasicDBObject();
documentDetail.put("records", "99");
documentDetail.put("index", "vps_index1");
documentDetail.put("active", "true");
document.put("detail", documentDetail);
collection.insert(document);
This tutorial is for good to get started. You can download MongoDB from github.
For optimization of MongoDB please refer this post.

Do you have to absolutely use MySQL or Are you open to other DBs as well? MongoDb or CouchDB will be a good fit for these kind of needs. Check them out if you are open to other DB options.
If you have to go absolutely with MySql, then we have done something similar all the related text messages go in a child as single json. We append to it every time and we keep master in a separate table. So one master and one child record at the minimum and more child records as the messages go beyond certain number ( 30 in our scenario) , implemented kind of "load more.." queries second child record which holds 30 more.
Hope this helps.
FYI, we are migrating to CouchDB for some other reasons and needs.

There are at least 2 different parts to this problem:
Processing the messages for storage in the database
What type of storage to use for the message
For processing the messages, you're likely going to need a horizontally scalable system (meaning you can add more machines to process the messages quickly) so you don't accumulate a huge backlog of messages. You should definitely not try to write these messages synchronously, but rather when a message is received, put it on a queue to be processed for writing to the database (something like JMS comes to mind here).
In terms of data storage, MySQL is a relational database, but it doesn't sound like you are really doing any relational data processing, rather just storing a large amount of data. I would suggest looking into a NoSQL database (as others have suggested here as well) such as MongoDB, Cassandra, CouchDB, etc. They each have their strengths and weaknesses (you can read more about each of them on their respective websites and elsewhere on the internet).

I guess, typical access would involve retrieving all text of one chat session at least.
The number of rows is large and your data is not so much relational. This is a good fit for Non-Relational database.
If you still want to go with MySQL, use Partitions. While writing, use batch inserts and while reading provide sufficient Partition pruning hints in your queries. Use EXPLAIN PARTITIONS to check whether partitions are being pruned. In this case I would strongly recommend that you combine chat lines of a one chat session into a single row. This will dramatically reduce the number of rows as compared to one chat line per row.
You didn't mention how many many days of data you want to store.
On a separate note: How successful would your app have to be in terms of users to require 200k messages per second? An active chat session may generate about 1 message every 5 seconds from one user. For ease of calculation lets make it 1 second. So you are building capacity for 200K online users. Which implies you would at least have a few million users.
It is good to think of scale early. However, it requires engineering effort. And since resources are limited, allocate them carefully for each task (Performance/UX etc). Spending more time on UX, for example, may yield a better ROI. When you get to multi-million user territory, new doors will open. You might be funded by an Angel or VC. Think of it as a good problem to have.
My 2 cents.

Is a good idea do processing of a large amount of data directly on database?

I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?

My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper Name´ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.

Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...

JDBC: is there are away to get a size of committed data in bytes

If I have a java application that performs some inserts against a database, if there an easy way to get how much bytes was committed (i.e. sum size of all the data in all the fields), without having to calculate it manually / fetching and checking the size of the result set?
--
As lucho points out, implementing statistics-aware statement class on top of the PreparedStatement might be the way to go. Going to stick with that and see how well this is going to work.

As far as I know, nope.
You'll have to ask your database that question; perhaps it's possible to do it without querying the same thing you inserted (because that sounds a bit pointless).

Interesting problem. I like lucho's solution, but I have two quicker (hackier) options:
You can try to use InnoDB's SHOW TABLE STATUS and keep a running log of the data size. That would let you know, but on my development machine calling it on one database takes 5.3s (56 tables) so unless you only want the data for one or two tables it's probably too slow (not to mention whatever locking it may incur).
You could monitor the DB process and use the OS to tell you how much it's writing. I know Windows can tell you this, and I'm pretty sure Linux can as well. But if you host 3 databases you'll only get the total, and it will be off some due to transactions and such.
Just random ideas.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.