H2 performance recommendations - java

I'm currently working with a somewhat larger database, and though I have no specific issues, I would like some recommendations, if anyone has any.
The database is 2.2 gigabyte (after recreation/compacting). It contains about 50 tables. One of those tables contains a blob plus some metadata. It currently has about 22000 rows. If I remove the blobs from the table (UPDATE table SET blob = null), the database size is reduced to about 200 megabyte (after recreation/compacting). The metadata is accessed a lot, the blobs however are not that often needed.
The database URL I currently use is:
jdbc:h2:D:/data;AUTO_SERVER=true;MVCC=true;CACHE_SIZE=524288
It runs in our Java VM which has 4GB max heap.
Some things I was wondering:
Would running H2 in a separate process have any impact on performance (for better or for worse)?
Would it help to have the blobs in a separate table with a 1-1 relation to the metadata? I could imagine it would help with the caching, not having the blobs in the way?
The internet seems divided on whether to include blobs in a database or write them to files on a filesystem with a link in the DB. Any H2-specific advise here?

The answer for you depends on the growth rate of your blob data. If for example, your data set is going to grow at 10% per week - then there is little point of trying to extend the use of H2 to store blob data (as it will quickly out pace the available heap memory). If instead the blob data is the biggest it will ever be, then attempting to use H2 might make sense.
To answer your questions about H2:
1) Running H2 in a separate process will allow H2 claim the majority of heap space - making controlling the available heap space for H2 much more manageable. However, you'll also be adding the maintenance overhead of having a separate process to maintain and monitor. So the answer is "it depends on your operating environment and goals". If you have the people and time, running H2 in a separate process might make sense. But if that's true - then you should probably consider just running an appropriate blob storage platform instead.
2) Yes, you're correct that storing the blobs in a separate table would help with caching - in the case that you don't often need the blobs. It should also help with retrieval times, as H2 won't have to read past the blobs to find the metadata.
3) Note that "the internet" represents many thousands of people with almost as many different specific use cases. You'll need to filter down your use case into requirements, and then apply the logic you glean from others.
4) My personal advice is, if you're trying to make a scalable and maintainable platform - use the right tools. H2, or any other relational database, is most often not the right tool for storing many large blobs. I'd recommend that you investigate using a key/value store.

Related

Pure Java alternative to database / cache for storing records

I have created an application sold to customers, some of which are hardware manufacturers with fixed constraints (slow CPU). The app has to be in java, so that it can be easily installed as a single package.
The application is multithreaded and maintains audio records. In this particular case all we have is INSERT SOMEDATA FOR RECORD, each record representing an audio file (and this can be done by different threads), and then later on we have SELECT SOMEDATA WHERE IDS in (x, y, z) by an single thread, then 3rd step is we actually DELETE all the data in this table.
The main constraint is cpu, slow single cpu. Memory is also a constraint, but only in that the application is designed so it can process an unlimited number of files, and so even if had lots of memory would eventually run out if all stored in memory rather than utilizing the disk.
In my Java application I started off using the H2 database to store all my data. But the software has to run on some slow single cpu servers so I want to reduce the cpu cycles used, and one area I want to look again is the database.
In many cases I am inserting data into database simply for the purposes of keeping the data off the heap otherwise would run out of memory, then later on we retrieve the data, we never have to UPDATE the data.
So I considered using a cache like ehCache but that has two problems:
It doesn't guarantee the data will not be thrown away (If the cache gets full)
I can only retrieve records one at a time, whereas with relational database I can retrieve a batch of records, this looks like a potential bottleneck.
What is an alternative that solves these issues ?
You want to retrieve records in batch fast, not loose any data, but you don't need optimized queries nor updates and you want to use CPU and memory resources as effectively as possible:
Why don't you simply store your records in a file? The operating system uses any free memory for caching. So when you access your file frequently, the OS will do its best to keep as much content as possible in memory. The OS does this job anyway, so this type of caching costs you no additional CPU and no single line of code.
The only scenarios where it could make sense to invest more in optimization would be:
a) Your process or other processes make heavy use of the file system and
pollute file cache
b) Serialization / deserialization is too expensive
In case of a):
Define your priorities. An explicit cache (in heap or off-heap) can help you to keep some content of selected files in memory. But this memory will not be avalaible anymore for the OS's file cache. So while you speed up one file access you potentially slow down access to other files.
In case of b):
Measure performance first, before you optimize anything. Usually disk access is the bottleneck - that's something you cannot change without replacing hardware. If you still want to optimize (e.g. because GC eats up CPU due to a very high number of temporarily created objects - i guess with only one core serial GC will be in use) then I suggest to have a closer look on Google flatbuffers.
You started with the most complex solution for your problem, a database. I suggest to start at the other end of the spectrum and keep it as simple as possible
UPDATE:
The question has been edited in the meanwhile and requirements have changed. A new requirement is now that it has to be possible to read selected records by IDs.
Possible extensions:
Store each record in an own file and use the key as file name
Store all records in one file and use a file-based HashMap implementation
like MapDB's HTreeMap implementation.
Independent from the chosen extension, the operating system's file cache will do its best to hold as much content as possible in main memory.
Some of ideas that can help
You say that you're running on a single CPU and want to check a substitution to H2. So, H2 "consumes" a lot of CPU power and the application is claimed to be "slow". But what if its because of slow Disk not a CPU, after all, Databases store their stuff on disks and the disks can be slow. If you want to check this theory - map the disk to some ram backed drive (in linux it's an easy task) and measure again with the same CPU.
If you come to the conclusion that indeed H2 is CPU intensive for use cases, maybe it worth to invest some time to optimize queries, this is much cheaper than substituting the database.
Now, if you can't stay with H2, consider Lucene which is really optimized for this "append-only" use-case (I understand that you have "append-only" flow because you said "later on we retrieve the data, we never have to UPDATE the data). Having said that Lucene also should have its own threads that handle indexing, so some CPU overhead is expected anyway. However, the chances are that Lucene will be faster for this use case. The price is that you won't get "easy" queries, because lucene doesn't implement relational model (well, maybe partially because of that it should be faster), in particular you won't have JOINs, and transaction management. Its possible to query by conditions from a single table like in RDMBS, you don't have to get "top hits" as you describe.
From your question and the comments made on Mark Bramniks answer I understood this:
CPU constraint: very slow cpu, solution should not be cpu intensive
Memory constraint: Not all data can be in memory
Disk constraint: very slow disk, solution should not read/write lots of data from disk
These are very strict constraints. Usually you "trade" cpu vs memory or memory vs disk. In your case these are all constraint. You mentioned you looked at ehCache, however I think this solution (and possibly others such as memcached) are not more lightweight than H2.
One solution you could try is MappedByteBuffer. This class makes it possible to have parts of a file in memory and will swap those parts when needed. But this comes at a cost, it is not an easy beast to tame. You will need to write your own algorithm to locate the data you need. Please consider how much time it will take you to get it working vs the additional cost of a bigger machine. Sometimes better hardware is the solution.
Relational databases like Oracle are decades old (41 years), can you imagine how many CPU cycles were available back then? Based on research from 1970 and well understood by professionals, tested, documented, reliable, consistent (checksums), maintainable (backups with zero data loss), performant if used correctly (all kinds of indexes), accessible securely over the network, scalable, etc but apparently Not Invented Here.
Nowadays there are even many free Open Source databases like PostgreSQL that have very modest requirements and the potential to easily implement new requirements in the future (which is hard to predict) and with some effort interchangeable with other databases (JDBC, JPA)
But yes, there is some overhead but typically hardware is cheaper than changing your architecture late in the project and CPU cycles are not an expensive resource anymore (think raspberry pi, smartphones, etc)

Why is file system storage faster than SQL databases

Extending this thread - I would just like to know why it's faster to retrieve files from a file system, rather than a MySQL database. If one were to benchmark the two to see which would retrieve the most data (multiple types of data) over 10 minutes - which one would win?
If a file system is truly faster, then why not just store everything in a file system and replace a database with csv or xml?
EDIT 1:
I found a good resource for alternate storage options for java
EDIT 2:
I'm looking for a Java API/Jar that has the functionality of a SQL Database Server Engine (or at least some of it) that uses XML for data storage (preferably). If you know of something, please leave a comment below.
At the end of the day the database does just store the data in the file system. It's all the useful stuff on top of just the raw data that makes you decide to use a database.
If you can replicate the functionality, scalability, robustness, integrity, etc, etc of a database system using CSV and still make it perform faster than a relational database then yes I'd suggest doing it your way.
It'd take you a few years to get there though.
Of course, relational systems are not the only way to store data. There are object-oriented database systems (db4o, InterSystems Cache) and document-based systems (RavenDB).
Performance is also relative to the style and volume of data you are working with and what you intend to do with it - I'm not going to even try and discuss that, it's too open ended.
I will also not start the follow on discussion: if memory is truly faster than the file system, why not just store everything in memory? :-)
This also seems similar to another question I answered a long while ago:
Is C# really slower than say C++?
Basically stuff isn't always done just for performance.
MySQL uses the file system the same as everything else on a computer. To retrieve a single piece of data, or a table of data, there is no faster way that directly from the file system. MySQL would just be a small bit of overhead added to that file system pull.
If you need to do some intelligent selecting, match some rows, or filter that data, MySQL is going to do that faster than most other options. The database server provides you calculation and data manipulation power that a filesystem can't.
When you have mixed/structured data, a DBMS is the only solution. For eg. try to get the people's name, surname and country for all your customers stored into your DB, but only those born in 1981 and living in Rome. If you have this data into files on the filesystem, how do you easily get only the required data without scanning all your files and how do you join returned data?
A DBMS give you much more than that.
Many DBMS store data into files.
This abstraction layer will make you retrieve data in a very easily, standard and structured way.
The difference is in how the desired data is located.
In a file system, locating the desired data means searching through all existing data until you find it.
Databases provide indexing which results in locating the desired data almost immediately (within ~12 comparisons) regardless of the amount of data.
What we want is an indexed file system - lucky for us, we have them. They are called databases.

Options when storing all data in memory doesn't scale

I've written a Java application that users install on there desktop. It crawls websites, storing the data about each page in a LinkedList. The application allows users to view all the pages crawled in a JTable.
This works great for small sites, but doesn't scale very well. Currently users have to allocate more memory (which translates to a -Xmx when starting Java) for larger crawls.
My current thinking is to move to storing all the data in a database, possibly using something like HSQLDB.
Are there any other approaches I should be considering?
relation db is not a good place to store web page data. could you save pages on disk? i you want to do searching on the crawling results. try the apache lucene searching engine. loading all the results all-in-once in memory is not reasonable. you can paginate the JTable model.,and use soft-reference to cache some results when pagination.
A relational database is probably the right approach for this case. Reasons:
It'll enable you to handle larger-than-memory crawls.
If you keep the link data in separate tables from the considerable larger volumes of page data, you may still be able to fit all your links in memory which will be pretty important from a performance and searching perspective
It will give you an easy way of persisting crawled data (in case this is needed in the future)
It's pretty well known / standard technology
There are good open source database implementation available (H2 or JavaDB would probably be my first choices as they are embeddable and written in pure Java)
The relational features could turn out to be useful, for example queries on link data
It doesn't sound like you have the data volumes or availability requirements that might push you towards a NoSQL-type solution
You have basically 4 options:
Store the data in flat files
Store the data in a database
Somehow transmit the data to "the cloud" (I have no idea how)
Somehow "pare" the data down to the essentials, knowing that you can re-extract the full info when needed
You can also do a variant of 4 to gain some space -- rather than a "rich" object structure, compress each distinct datum into a single String or byte[] or some such that you keep in an array or arraylist vs a linked list. This can reduce your storage requirements by 2X or more. Less "object oriented", but sometimes reality intervenes.
Try storing the page data in db4o http://community.versant.com , an object database. Object databases handle complex objects (eg. with lots of siblings) than relational databases.

When is BIG, big enough for a database?

I'm developing a Java application that has performance at its core.
I have a list of some 40,000 "final" objects,
i.e., I have an initialization input data of 40,000 vectors.
This data is unchanged throughout the program's run.
I am always preforming lookups against a single ID property to retrieve the proper vectors.
Currently I am using a HashMap over a sub-sample of a 1,000 vectors,
but
I'm not sure it will scale to production.
When is BIG, actually big enough for a use of DB?
One more thing, an SQLite DB is a viable option as no concurrency is involved,
so I guess the "threshold" for db use, is perhaps lower.
I think you're asking whether a HashMap with 40,000 entries in will be okay. The answer is yes - unless you really don't have enough memory, that should be absolutely fine. If you're writing a performance-sensitive app, then putting a large amount of fast memory in the machine running the app is likely to be an efficient way of boosting performance anyway.
There won't be very much overhead for each HashMap entry, so if you've got enough space to store the objects themselves in memory, it's unlikely that the overhead of the map would cause a problem.
Is there any reason why you can't just test this with a reasonable amount of data?
If you really have no more requirements than:
Read data at start-up
Put data in a map by a single ID (no need for joins, queries against different fields, substring matches etc)
Fetch data from map
... then using a full-blown database would be a huge amount of overkill, IMO.
As long as you're loading the data set in a memory at the beginning of the program and keeping it in memory and you don't have any complex queries, some sort of serialization/deserialization seems to be more feasible than a full blown database.
You could start a DB with as little as 100 (or less). There is no general rule of when the amount of data is large enough to store in a database. It's more if you believe you should better store this data in a database, if this will give you any profit (performance boost, easier programming, more flexible options for your users).
When the benefits are greater than the cost of implementation put it in a database.
There is no set size for a Collection vs a Database. It high depends on what you want to do with the data. Size is less important.
You can have a Map with a billion entries.
There's no such thing as 'big enough for a database'. The question is whether there are enough advantages in using a database to overcome the costs.
Having said that, 40,000 isn't 'big' ;-) Unless the objects are huge or you have complex query requirements I would start with an in-memory implementation. But if you expect to scale this number up over time it might be better to use the database from the beginning.
One option that you might want to consider is the Oracle Berkeley DB Java Edition library. It's a simple JAR file that can read/write data to persistent storage. Because of it's small footprint and ease of use, it's used for applications running on small to very large data sets. It's designed to be linked into the application, so that it's embedded and doesn't require complex client/server installation or protocol stacks.
What's even better is that it's extremely scalable (which works well if you end up with larger data sets than you expect), is very fast, and supports both a Java Collections API and a Direct Persistence Layer API (POJO-like). So you can use it seamlessly with Java Collections.
Berkeley DB Java Edition was designed specifically with Java application developers in mind. It's designed to be simple to use, light weight in terms of resources required, but very fast, scalable and reliable.
You can find information more about Oracle Berkeley DB Java Edition here
Regards,
Dave

Java handling large amounts of data

I have a Java application that needs to display large amounts of data (on the order of 1 million data points). The data doesn't all need to be displayed at the same time but rather only when requested by a user. The app is a desktop app that is not running with an app server or hitting any centralized database.
My thought was to run a database on the machine and load the data in there. The DB will be read only most of the time, so I should be able to index to help optimize queries. If I'm running on a local system, I'm not sure if I should try and implement some caching (I'm not sure how fast the queries will run, I'm currently working on them).
Is this is a logical way to approach the problem or would there be better approaches?
Thanks,
Jeff
Display and data are two different things.
You don't give any details about either, but it could be possible to generate the display in the background, bringing in the data one slice at a time, and then displaying when it's ready. Lots of anything could cause memory issues, so you'll need to be careful. The database will help persist things, but it won't help you get ten pounds of data into your five pound memory bag.
UPDATE: If individuals are only reading a few points at a time, and display isn't an issue, then I'd say that any database will be able to handle it if you index the table appropriately. One million rows isn't a lot for a capable database.
Embedded DB seems reasonable. Check out JavaDB/Derby or H2 or HSQLDB.
Sqlite with a java wrapper is fine too.
It really depends on your data. Do multiple instances request the data? If not, it is definitely worth to look for a simple SQLite database as the storage. It is just a single file on your file system. No need to set up a server.
Well, depends on data size. 1 Million integers for example isnt that much, but 1 Million data structures/classes or whatever with, lets say, 1000 Bytes size is much.
For small data: keep them in memory
For large data: i think using the DB would be good.
Just my opinion :)
edit:
Of course it depends also on the speed you want to achieve. If you really need high speed and the data is big you could also cache some of them in memory and leave the rest in the db.

Categories