Give this
public void do(RequestObject request, Callback<RequestObject> callback);
Where Callback is called when the request is processed. One client has to set status of the request to the database. The client fetches some items passes them to the above method and the callback sets the status.
It was working ok for small number of items and slower IO. But now, the IO is speed up and the status is written to database vary frequently. This is causing my database (MySQL) to make so many disk read write calls. My disk usage goes through the roof.
I was thinking of aggregating the setting of status but power in not reliable, that is not a plausible solution. How should re'design this?
EDIT
When the process is started I insert a value and when there is an update, I fetch the item and update the item. #user2612030 Your question lead me to believe, using hibernate might be what is causing more reads than it is necessary.
I can upgrade my disk drive to SSD but that would only do so much. I want a solution that scales.
An SSD is a good starting point, more RAM to MySQL should also help. It can't get rid of the writes, but with enough RAM (and MySQL configured to use it!) there should be few physical reads. If you are using the default configuration, tune it. See for example https://www.percona.com/blog/2016/10/12/mysql-5-7-performance-tuning-immediately-after-installation/ or just search for MySQL memory configuration.
You could also add disks and spread the writes to multiple disks with multiple controllers. That should also help a bit.
It is hard to give good advice without knowing how you record status values. Inserts or updates? How many records are there? Data model? However, to really scale you need to shard the data somehow. That way one server can handle data in one range and another server data in another range and so on.
For write-heavy applications that is non-trivial to set up with MySQL unless you do the sharding in the application code. Most solutions with replication work best for read-mostly applications. You may want to look into a NoSQL database, for example MongoDB, that has been designed for distributing writes from the outset. MongoDB has other challenges (eventual consistency), but it can deliver scalable writes.
Related
I am calling a web service which can return very high volume of data (>100K records = 200 MB). I have to insert this data in to SQL Server too. I have the following questions.
I know it depends on the server resources, but is there a ball park advice on limit of how much data should I store in any java structure (Collection -
with item having 4,5 string members each of length < 255) at run-time? I am already
using 50,000 records in each call (I am not sure how much memory
does it take)...
I then upload this data using batch sizes of 1000 to database using
JDBC. Is this correct approach? Would there be any benefit if I use
JPA for this instead of JDBC?
Also any standard design to handle this? I can think of breaking
down the web service calls into pages of limited size and then using Java Threads to handle them. Is this the right direction?
Thanks
First of all a web service which can return very high volume of data is no t enough information.A web service which can return very high volume of data ALWAYS, ONCE IN A WHILE , X% of THE TIME etc can help in designing a better system.
Its not advisable to use web services to exchange such a large quantity of data because it puts a strain on physical network infrastructure too but I guess that service is not part of your system.
Your application will be very unreliable with that amount of data per hit and you will need a very fast network too to get that amount of data.
and now coming to your points,
1.You have guessed it right, it all depends on server resources. There are applications which might be comfortable with a million records in a collection and at some places few thousands might be too much. You have to keep heap space and limits imposed by OS in mind. All in all - this is very specific to an application.
Purpose of collection plays a role too - is it for look up or just temporary storage to pass around data? how frequently does it get cleaned up? is it on stack or as object field? is it loaded once and cleaned before next load or keeps growing?
2.JDBC batch is a correct approach and not JPA.
3.if reading data from web service and storing data in DB is the main flow of the job , Spring Batch API might better fit into your design.
Hope it helps !!
In Java code I am trying to fetch 3500 rows from DB(Oracle). It takes almost 15 seconds to load the data. I have approached storing the result in Cache and retrieving from it too. I am using simple Select statement and displaying 8 columns from a single table (No joins used) .Using List to save the data from DB and using it as source for Datatable. I have also thought from hardware side such as RAM capacity, Storage, Network speed etc... It exceeds the minimum requirements comfortably. Can you help to do it quicker (Shouldn't take more than 3 seconds)?
Have you implemented proper indexing to your tables? I don't like to ask this since this is a very basic way of optimizing your tables for queries and you mention that you have already tried several ways. One of the workarounds that works for me is that if the purpose of the query is to display the results, the code can be designed in such a way that the query should immediately display the initial data while it is still loading more data. This implies to implement a separate thread for loading and separate thread for displaying.
It is most likely that the core problem is that you have one or more of the following:
a poorly designed schema,
a poorly designed query,
an badly overloaded database, and / or
a badly overloaded / underprovisioned network connection between the database and your client.
No amount of changing the client side (Java) code is likely to make a significant difference (i.e. a 5-fold increase) ... unless you are doing something crazy in the way you are building the list, or the bottleneck is in the display code not the retrieval.
You need to use some client-side and server-side performance tools to figure out whether the real bottleneck is the client, the server or the network. Then use those results to decide where to focus your attention.
I'm new to open source stacks and have been playing with hibernate/jpa/jdbc and memcache. I have a large data set per jdbc query and possibly will have a number these large data sets where I eventually bind to a chart.
However, I'm very focused on performance instead of hitting the database per page load to display it on my web page chart.
Are there some examples of how (memcache, redis, local or distributed) and where to cache this data (jSON or raw result data) to load in memory? Also I need to figure out how to refresh the cache unless it's a time based eviction marking algorithm (i.e. 30min expires so grab new data from data base query instead of using cache or perhaps an automated feed of data into the cache every xhrs/min/etc).?
Thanks!
This is typical problem and solution not straight forward. There are many factor which determine your design. Here is what we did sometime ago
Since our queries to extract data were a bit complex (took around a min to execute) and large dataset, we populated the memcache from a batch which used to pull data from database every 1 hour and push it to the memcached. By keeping the expiry cache larger than the batch interval, we made sure that where will always be data in cache.
There was another used case for dynamic caching, wherein on receiving the request for data, we checked first the memcached and if data not found, query the database, fetch the data, push it to memcached and return the results. But I would advise for this approach only when your database queries are simple and fast enough not to cause the poor overall response.
You can also used Hibernat's second level cache. It depends on your database schema, queries etc. to use this feature efficiently.
Hibernate has built-in support for 2nd level caching. Take a look at EhCache for example.
Also see: http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/performance.html#performance-cache
I am a Java developer. I want to know what is the best way to store huge data into mysql using Java.
Huge: two hundred thousand talk messages every second.
An index is not needed here
Should I store the messages into the database as soon as the user creates them? Will it be too slow?
1 billion writes / day is about 12k / second. Assuming each message is about 16 bytes, that's about 200k / sec. If you don't care about reading, you can easily write this to disk at this rate, maybe one message per line. Your read access pattern is probably going to dictate what you end up needing to do here.
If you use MySQL, I'd suggest combining multiple messages per row, if possible. Partitioning the table would be helpful to keep the working set in memory, and you'll want to commit a number of records per transaction, maybe 1000 rows. You'll need to do some testing and tuning, and this page will be helpful:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You should probably also look at Cassandra which is written with heavy write workloads in mind.
My suggestion is also MongoDB. Since NoSQL paradigm fits your needs perfectly.
Below is a flavor of MongoDB in Java -
BasicDBObject document = new BasicDBObject();
document.put("database", "mkyongDB");
document.put("table", "hosting");
BasicDBObject documentDetail = new BasicDBObject();
documentDetail.put("records", "99");
documentDetail.put("index", "vps_index1");
documentDetail.put("active", "true");
document.put("detail", documentDetail);
collection.insert(document);
This tutorial is for good to get started. You can download MongoDB from github.
For optimization of MongoDB please refer this post.
Do you have to absolutely use MySQL or Are you open to other DBs as well? MongoDb or CouchDB will be a good fit for these kind of needs. Check them out if you are open to other DB options.
If you have to go absolutely with MySql, then we have done something similar all the related text messages go in a child as single json. We append to it every time and we keep master in a separate table. So one master and one child record at the minimum and more child records as the messages go beyond certain number ( 30 in our scenario) , implemented kind of "load more.." queries second child record which holds 30 more.
Hope this helps.
FYI, we are migrating to CouchDB for some other reasons and needs.
There are at least 2 different parts to this problem:
Processing the messages for storage in the database
What type of storage to use for the message
For processing the messages, you're likely going to need a horizontally scalable system (meaning you can add more machines to process the messages quickly) so you don't accumulate a huge backlog of messages. You should definitely not try to write these messages synchronously, but rather when a message is received, put it on a queue to be processed for writing to the database (something like JMS comes to mind here).
In terms of data storage, MySQL is a relational database, but it doesn't sound like you are really doing any relational data processing, rather just storing a large amount of data. I would suggest looking into a NoSQL database (as others have suggested here as well) such as MongoDB, Cassandra, CouchDB, etc. They each have their strengths and weaknesses (you can read more about each of them on their respective websites and elsewhere on the internet).
I guess, typical access would involve retrieving all text of one chat session at least.
The number of rows is large and your data is not so much relational. This is a good fit for Non-Relational database.
If you still want to go with MySQL, use Partitions. While writing, use batch inserts and while reading provide sufficient Partition pruning hints in your queries. Use EXPLAIN PARTITIONS to check whether partitions are being pruned. In this case I would strongly recommend that you combine chat lines of a one chat session into a single row. This will dramatically reduce the number of rows as compared to one chat line per row.
You didn't mention how many many days of data you want to store.
On a separate note: How successful would your app have to be in terms of users to require 200k messages per second? An active chat session may generate about 1 message every 5 seconds from one user. For ease of calculation lets make it 1 second. So you are building capacity for 200K online users. Which implies you would at least have a few million users.
It is good to think of scale early. However, it requires engineering effort. And since resources are limited, allocate them carefully for each task (Performance/UX etc). Spending more time on UX, for example, may yield a better ROI. When you get to multi-million user territory, new doors will open. You might be funded by an Angel or VC. Think of it as a good problem to have.
My 2 cents.
Modern database provide caching support. Most of the ORM frameworks cache retrieved data too. Why this duplication is necessary?
Because to get the data from the database's cache, you still have to:
Generate the SQL from the ORM's "native" query format
Do a network round-trip to the database server
Parse the SQL
Fetch the data from the cache
Serialise the data to the database's over-the-wire format
Deserialize the data into the database client library's format
Convert the database client librarie's format into language-level objects (i.e. a collection of whatevers)
By caching at the application level, you don't have to do any of that. Typically, it's a simple lookup of an in-memory hashtable. Sometimes (if caching with memcache) there's still a network round-trip, but all of the other stuff no longer happens.
Here are a couple of reasons why you may want this:
An application caches just what it needs so you should get a better cache hit ratio
Accessing a local cache will probably be a couple of orders of magnitude faster than accessing the database due to network latency - even with a fast network
Scaling read-write transactions using a strongly consistent cache
Scaling read-only transactions can be done fairly easily by adding more Replica nodes.
However, that does not work for the Primary node since that can be only scaled vertically:
And that's where a cache comes into play. For read-write database transactions that need to be executed on the Primary node, the cache can help you reduce the query load by directing it to a strongly consistent cache, like the Hibernate second-level cache:
Using a distributed cache
Storing an application-level cache in the memory of the application is problematic for several reasons.
First, the application memory is limited, so the volume of data that can be cached is limited as well.
Second, when traffic increases and we want to start new application nodes to handle the extra traffic, the new nodes would start with a cold cache, making the problem even worse as they incur a spike in database load until the cache is populated with data:
To address this issue, it's better to have the cache running as a distributed system, like Redis. This way, the amount of cached data is not limited by the memory size on a single node since sharding can be used to split the data among multiple nodes.
And, when a new application node is added by the auto-scaler, the new node will load data from the same distributed cache. Hence, there's no cold cache issue anymore.
Even if a database engine caches data, indexes, or query result sets, it still takes a round-trip to the database for your application to benefit from that cache.
An ORM framework runs in the same space as your application. So there's no round-trip. It's just a memory access, which is generally a lot faster.
The framework can also decide to keep data in cache as long as it needs it. The database may decide to expire cached data at unpredictable times, when other concurrent clients make requests that utilize the cache.
Your application-side ORM framework may also cache data in a form that the database can't return. E.g. in the form of a collection of java objects instead of a stream of raw data. If you rely on database caching, your ORM has to repeat that transformation into objects, which adds to overhead and decreases the benefit of the cache.
Also, the database's cache might not be as practical as one think. I copied this from http://highscalability.com/bunch-great-strategies-using-memcached-and-mysql-better-together -- it's MySQL specific, tho.
Given that MySQL has a cache, why is memcached needed at all?
The MySQL cache is associated with just one instance. This limits the cache to the maximum address of one server. If your system is larger than the memory for one server then using the MySQL cache won't work. And if the same object is read from another instance its not cached.
The query cache invalidates on writes. You build up all that cache and it goes away when someone writes to it. Your cache may not be much of a cache at all depending on usage patterns.
The query cache is row based. Memcached can cache any type of data you want and it isn't limited to caching database rows. Memcached can cache complex complex objects that are directly usable without a join.
The performance considerations related to the network roundtrips have correctly been pointed out.
To that, it must be added that caching data anywhere else than in the dbms (NOT "database"), creates a problem of potentially obsoleted data that is still being presented as being "up to date".
Giving in to the temptations of performance improvement goes at the expense of losing the guarantee (watertight or at least close to that) of absolutely reliably and guaranteeably correct and consistent data.
Consider this every time accuracy and consistency is crucial.
A lot of good answers here. I'll add one other point: I know my access pattern, the database doesn't.
Depending on what I'm doing, I know that if the data ends up stale, that's not really a problem. The DB doesn't, and would have to reload the cache with the new data.
I know that I'll come back to a piece of data a few times over the next while, so it's important to keep around. The DB has to guess at what to keep in the cache, it's doesn't have the information I do. So if I fetch it from the DB over and over, it may not be in cache if the server is busy. I could get a cache miss. With my cache, I can be sure I get a hit. This is especially true on data that is non-trivial to get (i.e. a few joins, some group functions) as opposed to just a single row. Getting a row with the primary key of 7 is easy for the DB, but if it has to do some real work, the cost of the cache miss is much higher.
No doubt that modern databases are providing caching facility but when you are having more traffic on you site and that time you need to perform many database transaction then you will no get high performance.So to increase performance in this case hibernate cache will help you,
by optimizing the database applications. The cache actually stores the data already loaded from the database, so that the traffic between our application and the database will be reduced when the application want to access that data again.The access time and traffic will be reduced between the application and the database.
That said - caches can sometimes become a burden and actually slowdown the server. When you have high load the algorithm for what is cached and what is not might not fit right with the requests coming in...what you get is a cache that starts to operate like FIFO in overtime...this begins to make itself known when the table that sits behind the cache has significantly more records than are ever going to be cached in memory...
A good trade off would be to cluster the data for what you want to cache. Have a main server which pumps updates to the clusters, the time for when to send/pump the updates should be able to be tailored for each table depending on TTL (time to live) settings.
Your logic and data on the user node can then sit on the same server which opens up in memory databases or if it does have to fetch data then you could set it up to use a pipe instead of a network call...
This is something that takes some thought on how you want to use the data and when/if you cluster then you have to be aware of distributed transactions (transactions over more than one database)...but if the data being cached will be updated on its own without links into other db spaces then you can get away with this....
The problem with ORM caching is that if the database is updated independently through another application then the ORM cache can become out of date...Also it can get tricky if you do an update to a set...the update might update something that is in your cache and it needs to have some sort of algorithm to identify which records need to be removed/updated in memory (slowing down the update!?) - and then this algorithm becomes incredibly tricky and bug prone!
If using ORM caching then keep to a simple rule...cache simple objects that hardly ever change (user/role details for example) and that are small in size and are hit many times in a request...if its outside of this then I suggest clustering the data for performance.