Storing in Hashtables

Storing in Hashtables - java

I an working on an application that might potentially get thousands and thousands of messages (perhaps millions). And I want to store these messages in a hashtable for easy lookup since each message has an id. Is this a good idea? If not, what's the best data structure or way to go about this. Thank you.

Is this a good idea?
Keeping an unbounded amount of data in an in-memory data structure is a bad idea. You will eventually run out of memory, and your application will crash.
If you are able to discard old "messages" so that you can place a reasonable bound on the amount of memory the application needs, then this could be a viable solution.
However, as the comments point out there are other solutions (distibuted memory caches, SQL databases, NoSQL databases, etcetera) that could well be better, depending on how much data there is and how fast access really needs to be.

Using Map (data will be stored main memory) is simple, but should be the least preferable and non realistic option, as you need to and implement/reinvent the logic for the data expiration, clustering, etc.. by yourself.
Using Caching frameworks (data will be stored main memory), this can be chosen only if you have an idea about how much size of data and how long the data needs to be resided in the cache (i.e., when the data can expired and removed), this option limits the data size to the max size of the JVM Heap space.
Using Database (data will be stored in disc space), this is the ideal option for holding millions of data, but comes with a cost as disc operations takes more time compared to the in memory operations.

Related

Pure Java alternative to database / cache for storing records

I have created an application sold to customers, some of which are hardware manufacturers with fixed constraints (slow CPU). The app has to be in java, so that it can be easily installed as a single package.
The application is multithreaded and maintains audio records. In this particular case all we have is INSERT SOMEDATA FOR RECORD, each record representing an audio file (and this can be done by different threads), and then later on we have SELECT SOMEDATA WHERE IDS in (x, y, z) by an single thread, then 3rd step is we actually DELETE all the data in this table.
The main constraint is cpu, slow single cpu. Memory is also a constraint, but only in that the application is designed so it can process an unlimited number of files, and so even if had lots of memory would eventually run out if all stored in memory rather than utilizing the disk.
In my Java application I started off using the H2 database to store all my data. But the software has to run on some slow single cpu servers so I want to reduce the cpu cycles used, and one area I want to look again is the database.
In many cases I am inserting data into database simply for the purposes of keeping the data off the heap otherwise would run out of memory, then later on we retrieve the data, we never have to UPDATE the data.
So I considered using a cache like ehCache but that has two problems:
It doesn't guarantee the data will not be thrown away (If the cache gets full)
I can only retrieve records one at a time, whereas with relational database I can retrieve a batch of records, this looks like a potential bottleneck.
What is an alternative that solves these issues ?

You want to retrieve records in batch fast, not loose any data, but you don't need optimized queries nor updates and you want to use CPU and memory resources as effectively as possible:
Why don't you simply store your records in a file? The operating system uses any free memory for caching. So when you access your file frequently, the OS will do its best to keep as much content as possible in memory. The OS does this job anyway, so this type of caching costs you no additional CPU and no single line of code.
The only scenarios where it could make sense to invest more in optimization would be:
a) Your process or other processes make heavy use of the file system and
pollute file cache
b) Serialization / deserialization is too expensive
In case of a):
Define your priorities. An explicit cache (in heap or off-heap) can help you to keep some content of selected files in memory. But this memory will not be avalaible anymore for the OS's file cache. So while you speed up one file access you potentially slow down access to other files.
In case of b):
Measure performance first, before you optimize anything. Usually disk access is the bottleneck - that's something you cannot change without replacing hardware. If you still want to optimize (e.g. because GC eats up CPU due to a very high number of temporarily created objects - i guess with only one core serial GC will be in use) then I suggest to have a closer look on Google flatbuffers.
You started with the most complex solution for your problem, a database. I suggest to start at the other end of the spectrum and keep it as simple as possible
UPDATE:
The question has been edited in the meanwhile and requirements have changed. A new requirement is now that it has to be possible to read selected records by IDs.
Possible extensions:
Store each record in an own file and use the key as file name
Store all records in one file and use a file-based HashMap implementation
like MapDB's HTreeMap implementation.
Independent from the chosen extension, the operating system's file cache will do its best to hold as much content as possible in main memory.

Some of ideas that can help
You say that you're running on a single CPU and want to check a substitution to H2. So, H2 "consumes" a lot of CPU power and the application is claimed to be "slow". But what if its because of slow Disk not a CPU, after all, Databases store their stuff on disks and the disks can be slow. If you want to check this theory - map the disk to some ram backed drive (in linux it's an easy task) and measure again with the same CPU.
If you come to the conclusion that indeed H2 is CPU intensive for use cases, maybe it worth to invest some time to optimize queries, this is much cheaper than substituting the database.
Now, if you can't stay with H2, consider Lucene which is really optimized for this "append-only" use-case (I understand that you have "append-only" flow because you said "later on we retrieve the data, we never have to UPDATE the data). Having said that Lucene also should have its own threads that handle indexing, so some CPU overhead is expected anyway. However, the chances are that Lucene will be faster for this use case. The price is that you won't get "easy" queries, because lucene doesn't implement relational model (well, maybe partially because of that it should be faster), in particular you won't have JOINs, and transaction management. Its possible to query by conditions from a single table like in RDMBS, you don't have to get "top hits" as you describe.

From your question and the comments made on Mark Bramniks answer I understood this:
CPU constraint: very slow cpu, solution should not be cpu intensive
Memory constraint: Not all data can be in memory
Disk constraint: very slow disk, solution should not read/write lots of data from disk
These are very strict constraints. Usually you "trade" cpu vs memory or memory vs disk. In your case these are all constraint. You mentioned you looked at ehCache, however I think this solution (and possibly others such as memcached) are not more lightweight than H2.
One solution you could try is MappedByteBuffer. This class makes it possible to have parts of a file in memory and will swap those parts when needed. But this comes at a cost, it is not an easy beast to tame. You will need to write your own algorithm to locate the data you need. Please consider how much time it will take you to get it working vs the additional cost of a bigger machine. Sometimes better hardware is the solution.

Relational databases like Oracle are decades old (41 years), can you imagine how many CPU cycles were available back then? Based on research from 1970 and well understood by professionals, tested, documented, reliable, consistent (checksums), maintainable (backups with zero data loss), performant if used correctly (all kinds of indexes), accessible securely over the network, scalable, etc but apparently Not Invented Here.
Nowadays there are even many free Open Source databases like PostgreSQL that have very modest requirements and the potential to easily implement new requirements in the future (which is hard to predict) and with some effort interchangeable with other databases (JDBC, JPA)
But yes, there is some overhead but typically hardware is cheaper than changing your architecture late in the project and CPU cycles are not an expensive resource anymore (think raspberry pi, smartphones, etc)

Web Application Database or Maps for performance

I want to know whether it is useful to use ConcurrentHashMaps for user data. I have the user data saved in a mysql database and retrieve them when a user logs in (or someone edits the user). Every time when the user goes on another page, these user data will be refreshed. Should I use a map and save changes from my application there while having a database in background or should I directly download it from the db. I want to make the application as performant as possible.

What you are describing is a cache. Suppose the calls to the database cost a lot because there is a lot of info to load, or the query that is used to extract the data is complex and requires a lot of time. Here comes in play the cache data structure. It is basically an in memory storage, which is really faster w.r.t querying the database, because indeed, it is already loaded in memory.
The process of filling the cache takes the same time as querying the db for the data (generally more but in the same order). So it makes sense to use caches only if it brings benefit in time. There is a compromise though, speed vs freshness of data. Depending on your use-case you must find the right compromise between those two, and you shall afterwards find out if it is really convenient.
As you describe it, i.e user updates that needs to be saved and displayed, using a cache seems a bit an overkill IMO, unless you have lot of registered users, and so many of those are using the system simultaneously. If you decide to use it keep in mind of some concurrency issues that may rise. Concurrent hash maps saves you from many hazards but with performance compromise.

If the performance is the priority I think you should keep the logged users in memory.
That way, the read requests would be fast as you would not need to query the database. However, you would need to update the map if any of the logged users would be somehow edited.

A human cannot tell the difference between a 1ms delay and a 50ms delay. So it is overkill to optimize beyond "good enough".
MySQL already does a flavor of caching; your addition of another cache may actually slow down the response time.

Keeping data in database or in session

I'm in the early stages of doing a web project which will require working with arrays containing around 500 elements of custom object type. Objects will likely contain between 10 and 40 fields (based on user input), mostly booleans, strings and floats. I'm gonna use PHP for this project, but I'm also interested to know how to treat this problem in Java.
I know that "premature optimization is the root of all evil", but I think I need to decide now, how do I handle those arrays. Do I keep them in the Session object or do I store them in the database (mySQL) and keep just a minimum amount of keys in the session. Keeping data in the session would make application work faster, but when visitor numbers start growing I risk using up too much memory. On the other hand reading and writing from and into database all the time will degrade performance.
I'd like to know where the line is between those two approaches. How do I decide when it's too much data to keep inside session?

When I face a problem like this I try to estimate the size of per user data that I want to keep fast.
If your case, suppose for example to have 500 elements with 40 fields each of which sizing 50 bytes (making an average among texts, numbers, dates, etc.). So we have to keep in memory about 1MB per user for this storage, so you will have about 1GB every 1000 users only for this cache.
Depending on your server resource availability you can find bottlenecks: 1000 users consume CPU, memory, DB, disks accesses; so are in this scenario 1GB the problem? If yes keep them in DB if no keep them in memory.
Another option is to use an in-memory DB or a distributed cache solution that does it all for you, at some cost:
architectural complexity
eventually licence costs

I would be surprised if you had that amount of unique data for each user. Ideally, some of this data would be shared across users, and you could have some kind of application-level cache that stores the most recently used entries, and transparently fetches them from the database if they're missing.
This kind of design is relatively straightforward to implement in Java, but somewhat more involved (and possibly less efficient) with PHP since it doesn't have built-in support for application state.

When is BIG, big enough for a database?

I'm developing a Java application that has performance at its core.
I have a list of some 40,000 "final" objects,
i.e., I have an initialization input data of 40,000 vectors.
This data is unchanged throughout the program's run.
I am always preforming lookups against a single ID property to retrieve the proper vectors.
Currently I am using a HashMap over a sub-sample of a 1,000 vectors,
but
I'm not sure it will scale to production.
When is BIG, actually big enough for a use of DB?
One more thing, an SQLite DB is a viable option as no concurrency is involved,
so I guess the "threshold" for db use, is perhaps lower.

I think you're asking whether a HashMap with 40,000 entries in will be okay. The answer is yes - unless you really don't have enough memory, that should be absolutely fine. If you're writing a performance-sensitive app, then putting a large amount of fast memory in the machine running the app is likely to be an efficient way of boosting performance anyway.
There won't be very much overhead for each HashMap entry, so if you've got enough space to store the objects themselves in memory, it's unlikely that the overhead of the map would cause a problem.
Is there any reason why you can't just test this with a reasonable amount of data?
If you really have no more requirements than:
Read data at start-up
Put data in a map by a single ID (no need for joins, queries against different fields, substring matches etc)
Fetch data from map
... then using a full-blown database would be a huge amount of overkill, IMO.

As long as you're loading the data set in a memory at the beginning of the program and keeping it in memory and you don't have any complex queries, some sort of serialization/deserialization seems to be more feasible than a full blown database.

You could start a DB with as little as 100 (or less). There is no general rule of when the amount of data is large enough to store in a database. It's more if you believe you should better store this data in a database, if this will give you any profit (performance boost, easier programming, more flexible options for your users).
When the benefits are greater than the cost of implementation put it in a database.

There is no set size for a Collection vs a Database. It high depends on what you want to do with the data. Size is less important.
You can have a Map with a billion entries.

There's no such thing as 'big enough for a database'. The question is whether there are enough advantages in using a database to overcome the costs.
Having said that, 40,000 isn't 'big' ;-) Unless the objects are huge or you have complex query requirements I would start with an in-memory implementation. But if you expect to scale this number up over time it might be better to use the database from the beginning.

One option that you might want to consider is the Oracle Berkeley DB Java Edition library. It's a simple JAR file that can read/write data to persistent storage. Because of it's small footprint and ease of use, it's used for applications running on small to very large data sets. It's designed to be linked into the application, so that it's embedded and doesn't require complex client/server installation or protocol stacks.
What's even better is that it's extremely scalable (which works well if you end up with larger data sets than you expect), is very fast, and supports both a Java Collections API and a Direct Persistence Layer API (POJO-like). So you can use it seamlessly with Java Collections.
Berkeley DB Java Edition was designed specifically with Java application developers in mind. It's designed to be simple to use, light weight in terms of resources required, but very fast, scalable and reliable.
You can find information more about Oracle Berkeley DB Java Edition here
Regards,
Dave

Why use your application-level cache if database already provides caching?

Modern database provide caching support. Most of the ORM frameworks cache retrieved data too. Why this duplication is necessary?

Because to get the data from the database's cache, you still have to:
Generate the SQL from the ORM's "native" query format
Do a network round-trip to the database server
Parse the SQL
Fetch the data from the cache
Serialise the data to the database's over-the-wire format
Deserialize the data into the database client library's format
Convert the database client librarie's format into language-level objects (i.e. a collection of whatevers)
By caching at the application level, you don't have to do any of that. Typically, it's a simple lookup of an in-memory hashtable. Sometimes (if caching with memcache) there's still a network round-trip, but all of the other stuff no longer happens.

Here are a couple of reasons why you may want this:
An application caches just what it needs so you should get a better cache hit ratio
Accessing a local cache will probably be a couple of orders of magnitude faster than accessing the database due to network latency - even with a fast network

Scaling read-write transactions using a strongly consistent cache
Scaling read-only transactions can be done fairly easily by adding more Replica nodes.
However, that does not work for the Primary node since that can be only scaled vertically:
And that's where a cache comes into play. For read-write database transactions that need to be executed on the Primary node, the cache can help you reduce the query load by directing it to a strongly consistent cache, like the Hibernate second-level cache:
Using a distributed cache
Storing an application-level cache in the memory of the application is problematic for several reasons.
First, the application memory is limited, so the volume of data that can be cached is limited as well.
Second, when traffic increases and we want to start new application nodes to handle the extra traffic, the new nodes would start with a cold cache, making the problem even worse as they incur a spike in database load until the cache is populated with data:
To address this issue, it's better to have the cache running as a distributed system, like Redis. This way, the amount of cached data is not limited by the memory size on a single node since sharding can be used to split the data among multiple nodes.
And, when a new application node is added by the auto-scaler, the new node will load data from the same distributed cache. Hence, there's no cold cache issue anymore.

Even if a database engine caches data, indexes, or query result sets, it still takes a round-trip to the database for your application to benefit from that cache.
An ORM framework runs in the same space as your application. So there's no round-trip. It's just a memory access, which is generally a lot faster.
The framework can also decide to keep data in cache as long as it needs it. The database may decide to expire cached data at unpredictable times, when other concurrent clients make requests that utilize the cache.
Your application-side ORM framework may also cache data in a form that the database can't return. E.g. in the form of a collection of java objects instead of a stream of raw data. If you rely on database caching, your ORM has to repeat that transformation into objects, which adds to overhead and decreases the benefit of the cache.

Also, the database's cache might not be as practical as one think. I copied this from http://highscalability.com/bunch-great-strategies-using-memcached-and-mysql-better-together -- it's MySQL specific, tho.
Given that MySQL has a cache, why is memcached needed at all?
The MySQL cache is associated with just one instance. This limits the cache to the maximum address of one server. If your system is larger than the memory for one server then using the MySQL cache won't work. And if the same object is read from another instance its not cached.
The query cache invalidates on writes. You build up all that cache and it goes away when someone writes to it. Your cache may not be much of a cache at all depending on usage patterns.
The query cache is row based. Memcached can cache any type of data you want and it isn't limited to caching database rows. Memcached can cache complex complex objects that are directly usable without a join.

The performance considerations related to the network roundtrips have correctly been pointed out.
To that, it must be added that caching data anywhere else than in the dbms (NOT "database"), creates a problem of potentially obsoleted data that is still being presented as being "up to date".
Giving in to the temptations of performance improvement goes at the expense of losing the guarantee (watertight or at least close to that) of absolutely reliably and guaranteeably correct and consistent data.
Consider this every time accuracy and consistency is crucial.

A lot of good answers here. I'll add one other point: I know my access pattern, the database doesn't.
Depending on what I'm doing, I know that if the data ends up stale, that's not really a problem. The DB doesn't, and would have to reload the cache with the new data.
I know that I'll come back to a piece of data a few times over the next while, so it's important to keep around. The DB has to guess at what to keep in the cache, it's doesn't have the information I do. So if I fetch it from the DB over and over, it may not be in cache if the server is busy. I could get a cache miss. With my cache, I can be sure I get a hit. This is especially true on data that is non-trivial to get (i.e. a few joins, some group functions) as opposed to just a single row. Getting a row with the primary key of 7 is easy for the DB, but if it has to do some real work, the cost of the cache miss is much higher.

No doubt that modern databases are providing caching facility but when you are having more traffic on you site and that time you need to perform many database transaction then you will no get high performance.So to increase performance in this case hibernate cache will help you,
by optimizing the database applications. The cache actually stores the data already loaded from the database, so that the traffic between our application and the database will be reduced when the application want to access that data again.The access time and traffic will be reduced between the application and the database.

That said - caches can sometimes become a burden and actually slowdown the server. When you have high load the algorithm for what is cached and what is not might not fit right with the requests coming in...what you get is a cache that starts to operate like FIFO in overtime...this begins to make itself known when the table that sits behind the cache has significantly more records than are ever going to be cached in memory...
A good trade off would be to cluster the data for what you want to cache. Have a main server which pumps updates to the clusters, the time for when to send/pump the updates should be able to be tailored for each table depending on TTL (time to live) settings.
Your logic and data on the user node can then sit on the same server which opens up in memory databases or if it does have to fetch data then you could set it up to use a pipe instead of a network call...
This is something that takes some thought on how you want to use the data and when/if you cluster then you have to be aware of distributed transactions (transactions over more than one database)...but if the data being cached will be updated on its own without links into other db spaces then you can get away with this....
The problem with ORM caching is that if the database is updated independently through another application then the ORM cache can become out of date...Also it can get tricky if you do an update to a set...the update might update something that is in your cache and it needs to have some sort of algorithm to identify which records need to be removed/updated in memory (slowing down the update!?) - and then this algorithm becomes incredibly tricky and bug prone!
If using ORM caching then keep to a simple rule...cache simple objects that hardly ever change (user/role details for example) and that are small in size and are hit many times in a request...if its outside of this then I suggest clustering the data for performance.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.