Part of my project requires that we maintain stats of our customers products. More or less we want to show our customers how often their products has been viewed on the site
Therefore we want to create some form of Product Impressions Counter. I do not just mean a counter when we land on the specific product page, but when the product appears in search results and in our product directory lists.
I was thinking that after calling the DB I would extract the specific product ids and pass them to a service that will then inserted then into the stats tables. Or another is using some form of singleton buffer writer which writes to the DB after it reaches a certains size?
Has anyone ever encountered this in there projects and have any ideas that they would like to share?
And / or does anyone know of any framework or tools that could aid this development?
Any input would be really appreciated.
As long as you don't have performance problems, do not over-engineer your design. On the other hand, depending on how big the site is, it seem that you are going to have performance problems due to huge amount of writes.
I think real time updates will have a huge performance impact. Also it is very likely that you will update the same data multiple times in short period of time. Another thing is that, although interesting, storing this statistics is not mission-cricital and it shouldn't affect normal system work. Final thought: inconsistencies and minor inaccuracy is IMHO acceptable in this use case.
Taking all this into account I would temporarily hold the statistics in memory and flush them periodically as you've suggested. This has the additional benefit of merging events for the same product - if between two flushed some product was visited 10 times, you will only perform one update, not 10.
Technically, you can use properly synchronized singleton with background thread (a lot of handcrafting) or some intelligent cache with write-behind technology.
Related
I want to know whether it is useful to use ConcurrentHashMaps for user data. I have the user data saved in a mysql database and retrieve them when a user logs in (or someone edits the user). Every time when the user goes on another page, these user data will be refreshed. Should I use a map and save changes from my application there while having a database in background or should I directly download it from the db. I want to make the application as performant as possible.
What you are describing is a cache. Suppose the calls to the database cost a lot because there is a lot of info to load, or the query that is used to extract the data is complex and requires a lot of time. Here comes in play the cache data structure. It is basically an in memory storage, which is really faster w.r.t querying the database, because indeed, it is already loaded in memory.
The process of filling the cache takes the same time as querying the db for the data (generally more but in the same order). So it makes sense to use caches only if it brings benefit in time. There is a compromise though, speed vs freshness of data. Depending on your use-case you must find the right compromise between those two, and you shall afterwards find out if it is really convenient.
As you describe it, i.e user updates that needs to be saved and displayed, using a cache seems a bit an overkill IMO, unless you have lot of registered users, and so many of those are using the system simultaneously. If you decide to use it keep in mind of some concurrency issues that may rise. Concurrent hash maps saves you from many hazards but with performance compromise.
If the performance is the priority I think you should keep the logged users in memory.
That way, the read requests would be fast as you would not need to query the database. However, you would need to update the map if any of the logged users would be somehow edited.
A human cannot tell the difference between a 1ms delay and a 50ms delay. So it is overkill to optimize beyond "good enough".
MySQL already does a flavor of caching; your addition of another cache may actually slow down the response time.
In Java code I am trying to fetch 3500 rows from DB(Oracle). It takes almost 15 seconds to load the data. I have approached storing the result in Cache and retrieving from it too. I am using simple Select statement and displaying 8 columns from a single table (No joins used) .Using List to save the data from DB and using it as source for Datatable. I have also thought from hardware side such as RAM capacity, Storage, Network speed etc... It exceeds the minimum requirements comfortably. Can you help to do it quicker (Shouldn't take more than 3 seconds)?
Have you implemented proper indexing to your tables? I don't like to ask this since this is a very basic way of optimizing your tables for queries and you mention that you have already tried several ways. One of the workarounds that works for me is that if the purpose of the query is to display the results, the code can be designed in such a way that the query should immediately display the initial data while it is still loading more data. This implies to implement a separate thread for loading and separate thread for displaying.
It is most likely that the core problem is that you have one or more of the following:
a poorly designed schema,
a poorly designed query,
an badly overloaded database, and / or
a badly overloaded / underprovisioned network connection between the database and your client.
No amount of changing the client side (Java) code is likely to make a significant difference (i.e. a 5-fold increase) ... unless you are doing something crazy in the way you are building the list, or the bottleneck is in the display code not the retrieval.
You need to use some client-side and server-side performance tools to figure out whether the real bottleneck is the client, the server or the network. Then use those results to decide where to focus your attention.
I'm on a project asking for high performances... And I was told to use as a few database calls as possible, and to use more objects in the JVM memory. Right.
So... It didn't shock me at first, but now I'm questioning the approach.
How can I know which is best ?
On the one hand I would have :
- static Map <id1, id2>
- static Map <id2, ObjectX>
Object X
- id2
- map <id1, ObjectY>
Object Y
- id1
So basically, this data structure would help me to get an ObjectY from an id1. And I would be able to send back the whole ObjectX as well when needed.
You gotta know that the structure is filled by a service call (A). Then, updates to objects ObjectY can happen through another service (B). Finally, another service can send back an ObjectX (C). Which makes three services using the data.
On the other hand, I could have :
- db table for ObjectY T1
- db join table associating id1s and id2s T2
- db table for Object X T3
Service A would make an insert in the tables.
Service B would make an update in table T1
Service C would make a join between T2 and T1 to get all ObjectY objects for an ObjectX
In my opinion, the db version is more flexible... I am unsure about the performances, but I would say the db version shouldn't be slower than the "memory" version. Finally, hasn't the "memory" version got some risks ?
I hope it seems obvious to some of you I should choose one version and why... I'm hoping this not to be a debate. I'm looking for ways to know what's quicker...
Retrieving an object stored in memory will take on the order of hundreds of nanoseconds (less if it has been accessed recently and so it in a cache). Of course this latency will vary based on your platform, but this is a ballpark figure for comparison. Retrieving the same information from a database - again it depends on many factors such as whether the database is on the same machine - but it will take on the order of milliseconds at least i.e. tens of thousands of times slower.
Which is quicker - you will need to be more specific, which operations will you be measuring for speed? But the in-memory version will be faster in pretty much all cases. The database version gives different advantages - persistence, access from different machines, transactional commit / rollback - but speed is not one of them, not compared with an in-memory calculation.
Yes, the in-memory version has risks - basically if the machine is powered down (or the process exits for whatever reason...memory corruption, uncaught exception) then the data will be lost (i.e. in-memory solution does not have 'persistence' unlike a database).
What you are doing is building a cache. And it's a hugely popular and proven technique, with many implementations ranging from simple Map usage to full vendor products, support for caching across servers, and all sorts of bells and whistles.
And, done well, you should indeed get all sorts of performance improvements. But the main challenge in caching: how do you know when your cache entry is "stale", i.e. the DB has content that has changed, but your cache doesn't know about it?
You might have an obvious answer here. You might be caching stuff that actually won't change. Cache invalidation is the proper term here - when to refresh it because you know it's stale and you need fresh content.
I think all the trade offs that you rightly recognise are ones you personally need to weigh up, with the extra confidence that you're not "missing something".
One final thought - will you have enough memory to cache everything? Maybe you need to limit it, e.g. to the top 100,000 objects that get requested. Looking at 3rd party caching tools like EHCache, or Guava could be useful:
https://code.google.com/p/guava-libraries/wiki/CachesExplained
I'm in the early stages of doing a web project which will require working with arrays containing around 500 elements of custom object type. Objects will likely contain between 10 and 40 fields (based on user input), mostly booleans, strings and floats. I'm gonna use PHP for this project, but I'm also interested to know how to treat this problem in Java.
I know that "premature optimization is the root of all evil", but I think I need to decide now, how do I handle those arrays. Do I keep them in the Session object or do I store them in the database (mySQL) and keep just a minimum amount of keys in the session. Keeping data in the session would make application work faster, but when visitor numbers start growing I risk using up too much memory. On the other hand reading and writing from and into database all the time will degrade performance.
I'd like to know where the line is between those two approaches. How do I decide when it's too much data to keep inside session?
When I face a problem like this I try to estimate the size of per user data that I want to keep fast.
If your case, suppose for example to have 500 elements with 40 fields each of which sizing 50 bytes (making an average among texts, numbers, dates, etc.). So we have to keep in memory about 1MB per user for this storage, so you will have about 1GB every 1000 users only for this cache.
Depending on your server resource availability you can find bottlenecks: 1000 users consume CPU, memory, DB, disks accesses; so are in this scenario 1GB the problem? If yes keep them in DB if no keep them in memory.
Another option is to use an in-memory DB or a distributed cache solution that does it all for you, at some cost:
architectural complexity
eventually licence costs
I would be surprised if you had that amount of unique data for each user. Ideally, some of this data would be shared across users, and you could have some kind of application-level cache that stores the most recently used entries, and transparently fetches them from the database if they're missing.
This kind of design is relatively straightforward to implement in Java, but somewhat more involved (and possibly less efficient) with PHP since it doesn't have built-in support for application state.
I am currently thinking about Caching Strategies and more importantly avoiding any duplication of data inside the cache. My query is kind of language agnostic but very much programming related.
My question is regarding the efficient caching of paged or filtered data but more than that, distributed caching. The latter I have decided to go with memcached and more specifically a .NET port of it. I have seen another commercial option in the form of NCache, but memcached seems perfectly acceptable to me and apparently is used on facebook, myspace etc...
My query then is a strategy my which you can contain objects in cache and also a reference to them with paged data. If I have 100 items and I page them, then I could cache the ids of product 1-10 inside the cache and cache each product seperately. If I where to sort the items descending then items 1-10 would be different products so I would not want to store the actual objects each time the paged data/sorting/filtering changed, but instead stored the ids of the objects so I could then perform a trabsactional lookup in the databse if some of them do not already exist in the cache or are invalid.
My initial idea was this for a cache key.
paged_<pageNumber><pageSize><sort><sortDirection>[<filter>]
I would then iterate through the cache keys and remove any which start with "paged_" My question ultimately is if any one knows of any patterns or ideas about straties regarding caching of such patterns of data such as paged data and also making sure that objects are not cached more than once.
memcached is native code and would not have a problem clearing the cache in the way I have stated above, but it is an obvious fact that the more items in the cache the more time it would take. I am interested if anyone knows of any solution or theory to this type of problem which is currently beig employed. I am sure there will be . Thank you for your time
TIA
Andrew
I once tried, what I think, is a similar caching strategy and found it unwieldy. I eventually ended up just caching the objects that make up the pages and generating the pages for every request. 10 cache hits to construct a page is going to be (hopefully) sub second response time, pretty much instant to the users of your service.
If you must cache entire pages (I think of them as result sets) then perhaps you could run the user request through a hash and use that as your cache key. It's a hard problem to visualize with a concrete example or code (for me at least).