I am java developer and my application is in iOS and android.I have created web service for that and it is in restlet Framework as JDBC as DB connectivity.
My problem is i have three types of data it is called intersection like current + Past + Future.and this intersection contain list of user as a data.There is single web service for giving all users to device as his/her intersection.I have implement pagination but server has to process all of his/her intersections and out of this giving (start-End) data to device.I did this because there are chances that past user may also come in current.This the total logic.
But as intersection grows in his/her profile server has to process all user.so it become slow and this is obvious.also device call this web service in every 5 minutes.
please provide better suggestion to handle this scenario.
Thanks in advance.
Ketul Rathod
It's a little hard to follow your logic, but it sounds like you can probably benefit from caching your results on the server.
If it makes sense, after every time you process the users data on the server, save the results (to a file, to a database table, whatever). Then, in 5min, if there are no changes, simply return the same. If there were changes, retrieve from cache (optionally invalidating the cache in the process), append those changes to what is cached, re-save the results in the cache, and return the results.
If this is applicable to your workflow, your server-side processing time will be significantly less.
Related
I want to know whether it is useful to use ConcurrentHashMaps for user data. I have the user data saved in a mysql database and retrieve them when a user logs in (or someone edits the user). Every time when the user goes on another page, these user data will be refreshed. Should I use a map and save changes from my application there while having a database in background or should I directly download it from the db. I want to make the application as performant as possible.
What you are describing is a cache. Suppose the calls to the database cost a lot because there is a lot of info to load, or the query that is used to extract the data is complex and requires a lot of time. Here comes in play the cache data structure. It is basically an in memory storage, which is really faster w.r.t querying the database, because indeed, it is already loaded in memory.
The process of filling the cache takes the same time as querying the db for the data (generally more but in the same order). So it makes sense to use caches only if it brings benefit in time. There is a compromise though, speed vs freshness of data. Depending on your use-case you must find the right compromise between those two, and you shall afterwards find out if it is really convenient.
As you describe it, i.e user updates that needs to be saved and displayed, using a cache seems a bit an overkill IMO, unless you have lot of registered users, and so many of those are using the system simultaneously. If you decide to use it keep in mind of some concurrency issues that may rise. Concurrent hash maps saves you from many hazards but with performance compromise.
If the performance is the priority I think you should keep the logged users in memory.
That way, the read requests would be fast as you would not need to query the database. However, you would need to update the map if any of the logged users would be somehow edited.
A human cannot tell the difference between a 1ms delay and a 50ms delay. So it is overkill to optimize beyond "good enough".
MySQL already does a flavor of caching; your addition of another cache may actually slow down the response time.
Give this
public void do(RequestObject request, Callback<RequestObject> callback);
Where Callback is called when the request is processed. One client has to set status of the request to the database. The client fetches some items passes them to the above method and the callback sets the status.
It was working ok for small number of items and slower IO. But now, the IO is speed up and the status is written to database vary frequently. This is causing my database (MySQL) to make so many disk read write calls. My disk usage goes through the roof.
I was thinking of aggregating the setting of status but power in not reliable, that is not a plausible solution. How should re'design this?
EDIT
When the process is started I insert a value and when there is an update, I fetch the item and update the item. #user2612030 Your question lead me to believe, using hibernate might be what is causing more reads than it is necessary.
I can upgrade my disk drive to SSD but that would only do so much. I want a solution that scales.
An SSD is a good starting point, more RAM to MySQL should also help. It can't get rid of the writes, but with enough RAM (and MySQL configured to use it!) there should be few physical reads. If you are using the default configuration, tune it. See for example https://www.percona.com/blog/2016/10/12/mysql-5-7-performance-tuning-immediately-after-installation/ or just search for MySQL memory configuration.
You could also add disks and spread the writes to multiple disks with multiple controllers. That should also help a bit.
It is hard to give good advice without knowing how you record status values. Inserts or updates? How many records are there? Data model? However, to really scale you need to shard the data somehow. That way one server can handle data in one range and another server data in another range and so on.
For write-heavy applications that is non-trivial to set up with MySQL unless you do the sharding in the application code. Most solutions with replication work best for read-mostly applications. You may want to look into a NoSQL database, for example MongoDB, that has been designed for distributing writes from the outset. MongoDB has other challenges (eventual consistency), but it can deliver scalable writes.
I am calling a web service which can return very high volume of data (>100K records = 200 MB). I have to insert this data in to SQL Server too. I have the following questions.
I know it depends on the server resources, but is there a ball park advice on limit of how much data should I store in any java structure (Collection -
with item having 4,5 string members each of length < 255) at run-time? I am already
using 50,000 records in each call (I am not sure how much memory
does it take)...
I then upload this data using batch sizes of 1000 to database using
JDBC. Is this correct approach? Would there be any benefit if I use
JPA for this instead of JDBC?
Also any standard design to handle this? I can think of breaking
down the web service calls into pages of limited size and then using Java Threads to handle them. Is this the right direction?
Thanks
First of all a web service which can return very high volume of data is no t enough information.A web service which can return very high volume of data ALWAYS, ONCE IN A WHILE , X% of THE TIME etc can help in designing a better system.
Its not advisable to use web services to exchange such a large quantity of data because it puts a strain on physical network infrastructure too but I guess that service is not part of your system.
Your application will be very unreliable with that amount of data per hit and you will need a very fast network too to get that amount of data.
and now coming to your points,
1.You have guessed it right, it all depends on server resources. There are applications which might be comfortable with a million records in a collection and at some places few thousands might be too much. You have to keep heap space and limits imposed by OS in mind. All in all - this is very specific to an application.
Purpose of collection plays a role too - is it for look up or just temporary storage to pass around data? how frequently does it get cleaned up? is it on stack or as object field? is it loaded once and cleaned before next load or keeps growing?
2.JDBC batch is a correct approach and not JPA.
3.if reading data from web service and storing data in DB is the main flow of the job , Spring Batch API might better fit into your design.
Hope it helps !!
I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper NameĀ“ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?
My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper NameĀ“ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.
Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...
I am developing an application which will be integrated with thousands of sensors sending information at every 15 minute interval. Let's assume that the format of the data for all sensors is same. What is the best strategy of storing this data so that every thing is archived (is accessible) and does not have a negative impact due to large size of growing data.
Th question is related to general database design I suppose, but I would like to mention that I am using Hibernate (with Spring Roo) so perhaps there is some thing already out there addressing it.
Edit: sensors are dumb, and off the shelf. It is not possible to extend them. In the case of a network outage all information is lost. Since the sensors work on GPRS this scenario will be some what unlikely (as the GPRS provider is a rather good one here in Sweden, but yes it can go down and one can do nothing about it).
A queuing mechanism was foremost in consideration and spring roo provides easy to work with prototype code based on ACTIVEMQ.
I'd have a couple of concerns about this design:
Hibernate is an ORM tool. It demands an object model on one side and a relational one on the other. Do you have an object representation? If not, I'd say that Hibernate isn't the way to go. If it's a simple table mapping mechanism you'll be fine.
Your situation sounds like war: long periods of boredom surrounded by instants of sheer terror. I don't know if your design uses asynchronous mechanisms between the receipt of the sensor data and the back end, but I'd want to have some kind of persistent queuing mechanism to guarantee delivery of all the data and an orderly line while they were waiting to be persisted. As long as you don't need to access the data in real time, a queue will guarantee delivery and make sure you don't have thousands of requests showing up at a bottleneck at the same time.
How are you time stamping the sensor items as they come in? You might want to use a column that goes down to nanoseconds to get these right.
Are the sensors event-driven or timed?
Sounds like a great problem. Good luck.
Let's assume you have 10,000 sensor sending information every 15 minutes. To have better performance on database side you may have to partition your database possibly by date/time, sensor type or category or some other factor. This also depend on how you will be query your data.
http://en.wikipedia.org/wiki/Partition_(database)
Other bottle neck would be your Java/Java EE application itself. This depends on your business like, are all 150,000 sensors gonna send information at same time? and what architecture your java application gonna follow. You will have to read articles on high scalablity and performance.
Here is my recommendation for Java/Java EE solution.
Instead of single, have a cluster of applications receiving the data.
Have a controller application that controls link between which sensor sends data to which instance of application in the cluster. Application instance may pull data from sensor or sensor can push data to an application instance but controller is the one who will control which application instance is linked to which set of sensors. This controller must be dynamic such that sensors can be added or removed or updated as well application instances can join or leave cluster at any time. Make sure that you have some fail over capability into your controller.
So if you have 10,000 sensors and 10 instances of application in cluster, you have 1000 sensors linked to an application at any given time. If you still want better performance, you can have say 20 instances of application in cluster and you will have 500 sensors linked to an application instance.
Application instances can be hosted on same or multiple machines so that vertical as well as horizontal scalability is achieved. Each application instance will be multi threaded and have a local persistence. This will avoid bottle neck on to main database server and decrease your transaction response time. This local persistence can be a SAN file(s) or local RDBMS (like Java DB) or even MQ. If you persist locally in database, then you can use Hibernate for same.
Asynchronously move data from local persistence to main database. This depends on how have you persisted data locally.
If you use file based persistence, you need a separate thread that reads data from file and inserts in main database repository.
If you use a local database then this thread can use Hibernate to read data locally and insert it on main database repository.
If you use MQ, you can have thread or separate application to move data from queue to main database repository.
Drawback to this solution is that there will be some lag between sensor having reported some data and that data appearing in main database.
Advantage in this solution is that it will give you high performance, scalability, and fail-over.
This means you are going to get about 1 record/second multiplied by how many thousand sensors you have, or about 2.5 million rows/month multiplied by how many thousand sensors you have.
Postgres has inheritance and partitioning. That would make it practical to have tables like:
sensordata_current
sensordata_2010_01
sensordata_2009_12
sensordata_2009_11
sensordata_2009_10
.
.
.
each table containing measurements for one month. Then a parent table sensordata can be created that "consists" of these child tables, meaning queries against sensordata would automatically go through the child tables, but only the ones which the planner deduces can contain data for that query. So if you say partitioned your data by months (which is a date range), and you expressed that wish with a date constraint on each child table, and you query by date range, then the planner - based on the child table constraints - will be able to exclude those child tables from execution of the query which do not contain rows satisfying the date range.
When a month is complete (say 2010 Jan just turned 2010 Feb), you would rename sensordata_current to the just completed month (2010_01), create a new sensordata_current, move over any rows from 2010_01 into the newly created sensordata_current that have a timestamp in Feb, add finally a constraint to 2010_01 that expresses that it only has data in 2010 Jan. Also drop unneeded indices on 2010_01. In Postgres this all can be made atomic by enclosing it into a transaction.
Alternatively, you might need to leave _current alone, and create a new 2010_01 and move over all January rows into it from _current (then optionally vacuum _current to immediately reclaim the space - though if your rows are consant size, with recent Postgres versions there is not much point in doing that). Your move (SELECT INTO / DELETE) will take longer in this case, but you won't have to write code to recreate indices, and this would also preserve other details (referential integrity, etc.).
With this setup removing old data is as quick and efficient as dropping child tables. And migrating away old data is efficient too since child tables are also accessible directly.
For more details see Postgres data partitioning.
Is it a requirement that these sensors connect directly to an application to upload their data? And this application is responsible for writing the data to the database?
I would consider having the sensors write data to a message queue instead, and having your "write to DB" application be responsible for picking up new data from the queue and writing it to the database. This seems like a pretty classic example of "message producers" and "message consumers", i.e. the sensors and the application, respectively.
This way, the sensors are not affected if your "write to DB" application has any downtime, or if it has any performance issues or slowdowns from the database, etc. This would also allow you to scale up the number of message consumers in the future without affecting the sensors at all.
It might seem like this type of solution simply moves the possible point of failure from your consumer application to a message queue, but there are several options for making the queue fault-reliant - clustering, persistent message storage, etc.
Apache MQ is a popular message queue system in the Java world.