Processing a big list from DB in Java - java

I have a big list of over 20000 items to be fetched from DB and process it daily in a simple console based Java App.
What is the best way to do that. Should I fetch the list in small sets and process it or should I fetch the complete list into an array and process it. Keeping in an array means huge memory requirement.
Note: There is only one column to process.
Processing means, I have to pass that string in column to somewhere else as a SOAP request.
20000 items are string of length 15.

It depends. 20000 is not really a big number. If you are only processing 20000 short strings or numbers, the memory requirement isn't that large. But if it's 20000 images that is a bit larger.
There's always a tradeoff. Multiple chunks of data means multiple trips to the database. But a single trip means more memory. Which is more important to you? Also can your data be chunked? Or do you need for example record 1 to be able to process record 1000.
These are all things to consider. Hopefully they help you come to what design is best for you.

Correct me If I am Wrong , fetch it little by little , and also provide a rollback operation for it .

If the job can be done on a database level i would fo it using SQL sripts, should this be impossible i can recommend you to load small pieces of your data having two columns like the ID-column and the column which needs to be processed.
This will enable you a better performance during the process and if you have any crashes you will not loose all processed data, but in a crash case you eill need to know which datasets are processed and which not, this can be done using a 3rd column or by saving the last processed Id each round.

Related

Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch

Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.
The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.
Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.
I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.
Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?
While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:
1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)
2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.
3) Interpolate in your graph and maybe try out 3 values between your best values of run 2
Then use the final result as your optimal stream bulk insertion size.
Once you have this value, you can add one more step:
Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.
This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.

best way to store huge data into mysql using java

I am a Java developer. I want to know what is the best way to store huge data into mysql using Java.
Huge: two hundred thousand talk messages every second.
An index is not needed here
Should I store the messages into the database as soon as the user creates them? Will it be too slow?
1 billion writes / day is about 12k / second. Assuming each message is about 16 bytes, that's about 200k / sec. If you don't care about reading, you can easily write this to disk at this rate, maybe one message per line. Your read access pattern is probably going to dictate what you end up needing to do here.
If you use MySQL, I'd suggest combining multiple messages per row, if possible. Partitioning the table would be helpful to keep the working set in memory, and you'll want to commit a number of records per transaction, maybe 1000 rows. You'll need to do some testing and tuning, and this page will be helpful:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You should probably also look at Cassandra which is written with heavy write workloads in mind.
My suggestion is also MongoDB. Since NoSQL paradigm fits your needs perfectly.
Below is a flavor of MongoDB in Java -
BasicDBObject document = new BasicDBObject();
document.put("database", "mkyongDB");
document.put("table", "hosting");
BasicDBObject documentDetail = new BasicDBObject();
documentDetail.put("records", "99");
documentDetail.put("index", "vps_index1");
documentDetail.put("active", "true");
document.put("detail", documentDetail);
collection.insert(document);
This tutorial is for good to get started. You can download MongoDB from github.
For optimization of MongoDB please refer this post.
Do you have to absolutely use MySQL or Are you open to other DBs as well? MongoDb or CouchDB will be a good fit for these kind of needs. Check them out if you are open to other DB options.
If you have to go absolutely with MySql, then we have done something similar all the related text messages go in a child as single json. We append to it every time and we keep master in a separate table. So one master and one child record at the minimum and more child records as the messages go beyond certain number ( 30 in our scenario) , implemented kind of "load more.." queries second child record which holds 30 more.
Hope this helps.
FYI, we are migrating to CouchDB for some other reasons and needs.
There are at least 2 different parts to this problem:
Processing the messages for storage in the database
What type of storage to use for the message
For processing the messages, you're likely going to need a horizontally scalable system (meaning you can add more machines to process the messages quickly) so you don't accumulate a huge backlog of messages. You should definitely not try to write these messages synchronously, but rather when a message is received, put it on a queue to be processed for writing to the database (something like JMS comes to mind here).
In terms of data storage, MySQL is a relational database, but it doesn't sound like you are really doing any relational data processing, rather just storing a large amount of data. I would suggest looking into a NoSQL database (as others have suggested here as well) such as MongoDB, Cassandra, CouchDB, etc. They each have their strengths and weaknesses (you can read more about each of them on their respective websites and elsewhere on the internet).
I guess, typical access would involve retrieving all text of one chat session at least.
The number of rows is large and your data is not so much relational. This is a good fit for Non-Relational database.
If you still want to go with MySQL, use Partitions. While writing, use batch inserts and while reading provide sufficient Partition pruning hints in your queries. Use EXPLAIN PARTITIONS to check whether partitions are being pruned. In this case I would strongly recommend that you combine chat lines of a one chat session into a single row. This will dramatically reduce the number of rows as compared to one chat line per row.
You didn't mention how many many days of data you want to store.
On a separate note: How successful would your app have to be in terms of users to require 200k messages per second? An active chat session may generate about 1 message every 5 seconds from one user. For ease of calculation lets make it 1 second. So you are building capacity for 200K online users. Which implies you would at least have a few million users.
It is good to think of scale early. However, it requires engineering effort. And since resources are limited, allocate them carefully for each task (Performance/UX etc). Spending more time on UX, for example, may yield a better ROI. When you get to multi-million user territory, new doors will open. You might be funded by an Angel or VC. Think of it as a good problem to have.
My 2 cents.

Best way to sort the data : DB Query or in Application Code

I have a Mysql table with some data (> million rows). I have a requirement to sort the data based on the below criteria
1) Newest
2) Oldest
3) top rated
4) least rated
What is the recommended solution to develop the sort functionality
1) For every sort reuest execute a DBQuery with required joins and orderBy conditions and return the sorted data
2) Get all the data (un sorted) from table, put the data in cache. Write custom comparators (java) to sort the data.
I am leaning towards #2 as the load on DB is only once. Moreover, application code is better than DBQuery.
Please share your thoughts....
Thanks,
Karthik
Do as much in the database as you can. Note that if you have 1,000,000 rows, returning all million is nearly useless. Are you going to display this on a web site? I think not. Do you really care about the 500,000th least popular post? Again, I think not.
So do the sorts in the database and return the top 100, 500, or 1000 rows.
It's much faster to do it in the database:
1) the database is optimized for I/O operations, and can use indices, and other DB optimizations to improve the response time
2) taking the data from the database to the application will get all data into memory. The app will have to look all the data to redorder it without optimized algorithms
3) the database only takes the minimun necessary data into mamemory, which can be much less than all the data whihc has to be moved to java
4) you can always create extra indices on the database to improve the query performance.
I would say that operation on DB will be always faster. You should ensure that caching on DB is ON and working properly. Ensure that you are not using now() in your query because it will disable mysql cache. Take a look here how mysql query cache works. In basic. Query is cached based on string so if query string differs every time you fetch no cache is used.
AFAIK usually it should run faster if you let the DB sort your data.
And regarding code on application level vs db level I would agree in the case of stored procedures but sorting in SELECTs is fine IMHO.
If you want to show the data to the user also consider paging (in which case you're better off with sorting on the db level anyway).
Fetching a million rows from the database sounds like a terrible idea. It will generate a lot of networking traffic and require quite some time to transfer all the data. Not mentioning amounts of memory you would need to allocate in your application for storing million of objects.
So if you can fetch only a subset with a query, do that. Overall, do as much filtering as you can in the database.
And I do not see any problem in ordering in a single queue. You can always use UNION if you can't do it as one SELECT.
You do not have four tasks, you have two:
sort newest IS EQUAL TO sort oldest
AND
sort top rated IS EQUAL TO sort least rated.
So you need to make two calls to db. Yes sort in db. then instead of calling to sort every time, do this:
1] track the timestamp of the latest record in the db
2] before calling to sort and retrieve entire list, check if date has changed
3] if date has not changed, use the list you have in memory
4] if date has changed, update the list
I know this is an old thread, but it comes up in my search, so I'd like to post my opinion.
I'm a bit old school, but for that many rows, I would consider dumping the data from your database (each RDBMS has it's own method. Looks like MySQLDump command for MySQL: Link )
You can then process this with sorting algorithms or tools that are available in your java libraries or operating system.
Be careful about the work your asking your database to do. Remember that it has to be available to service other requests. Don't "bring it to it's knees" servicing only one request, unless it's a nightly batch cycle type of scenario and you're certain it won't be asked to do anything else.

Is it faster to access a java list (arraylist) compared to accessing the same data in a mysql database?

I have the MYSQL database in the local machine where I'm running the java program from.
I plan create a array list of all the entries of a particular table. From this point on wards I will not access the database to get a particular entry in the table, instead I will use the array list created. Is this going to be faster or slower compared to accessing the database to grab a particular entry in the table?
Please note that the table I'm interested has about 2 million entries.
Thank you.
More info : I need only two fields. 1 of type Long and 1 of type String. The index of the table is Long , not int.
No, it's going to be much slower, because to find an element in an ArrayList, you've to scan sequentially the ArrayList until your element is found.
It can be faster, for a few hundreds entry, because you don't have the connection overhead, but with two millions entry, MySQL is going to win, provided that you create the correct indexes. Only retrieve the rows that you actually need each time.
Why are you thinking to do this? Are you experiencing slow queries?
To find out, in your my.cnf activate the slow query log, by uncommenting (or adding) the following lines.
# Here you can see queries with especially long duration
log_slow_queries = /var/log/mysql/mysql-slow.log
long_query_time = 1
Then see which queries take a long time, and run them with EXPLAIN in front, consider to add index where the explain command tells you that is not using indexes, or just post a new question with your CREATE TABLE statement and your example query to optimize.
This question is too vague, and can easily go either way depending on:
How many fields in each record, how big are the fields?
What kind of access are you going to perform? Text search? Sequential?
For example, if each records consists of a couple bytes of data it's much faster to store them all in-memory (not necessarily an ArrayList though). You may want to put them into a TreeSet for example.
It depends on what you will do with the data. If you just wanted a few rows, only those should be fetched from the DB. If you know that you need ALL the data, go ahead and load the whole table into java if it can fit in memory. What will you do with it after? Sequencial or random reading? Will data be changed? A Map or Set could be a faster alternative depending on how the collection will be used.
Whether it is faster or slower is measurable. Time it. It is definitely faster to work with structures stored in memory than it is to work with data tables located on the disk. That is if you have enough memory and if you do not have 20 users running the same process at the same time.
How do you access the data? Do you have an integer index?
First, accessing an array list is much much faster than accessing a data base. Accessing memory is much more faster than accessing a hard disk.
If the number of entries in the array is big and I guess it is, then you need to consider using a "direct access" data structure such as a HashMap which will act as a database table where you have values referenced by their keys

Java handling large amounts of data

I have a Java application that needs to display large amounts of data (on the order of 1 million data points). The data doesn't all need to be displayed at the same time but rather only when requested by a user. The app is a desktop app that is not running with an app server or hitting any centralized database.
My thought was to run a database on the machine and load the data in there. The DB will be read only most of the time, so I should be able to index to help optimize queries. If I'm running on a local system, I'm not sure if I should try and implement some caching (I'm not sure how fast the queries will run, I'm currently working on them).
Is this is a logical way to approach the problem or would there be better approaches?
Thanks,
Jeff
Display and data are two different things.
You don't give any details about either, but it could be possible to generate the display in the background, bringing in the data one slice at a time, and then displaying when it's ready. Lots of anything could cause memory issues, so you'll need to be careful. The database will help persist things, but it won't help you get ten pounds of data into your five pound memory bag.
UPDATE: If individuals are only reading a few points at a time, and display isn't an issue, then I'd say that any database will be able to handle it if you index the table appropriately. One million rows isn't a lot for a capable database.
Embedded DB seems reasonable. Check out JavaDB/Derby or H2 or HSQLDB.
Sqlite with a java wrapper is fine too.
It really depends on your data. Do multiple instances request the data? If not, it is definitely worth to look for a simple SQLite database as the storage. It is just a single file on your file system. No need to set up a server.
Well, depends on data size. 1 Million integers for example isnt that much, but 1 Million data structures/classes or whatever with, lets say, 1000 Bytes size is much.
For small data: keep them in memory
For large data: i think using the DB would be good.
Just my opinion :)
edit:
Of course it depends also on the speed you want to achieve. If you really need high speed and the data is big you could also cache some of them in memory and leave the rest in the db.

Categories